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Foreword 



The papers contained in this volume were presented at the Tenth Annual Sym- 
posium on Combinatorial Pattern Matching, held July 22 - 24, 1999 at the 
University of Warwick, England. They were selected from 26 abstracts submit- 
ted in response to the call for papers. In addition, invited lectures were given by 
Joan Feigenbaum from AT&T Labs Research {Massive graphs: algorithms, appli- 
cations, and open problems) and David Jones from the Department of Biology, 
University of Warwick {Optimizing biological sequences and protein structures 
using simulated annealing and genetic algorithms). 

The symposium was preceded by a two-day summer school set up to attract 
and train young researchers. The lecturers of the school were Alberto Apos- 
tolico {Computational Theories of Surprise), Joan Feigenbaum {Algorithmics of 
network-generated massive data sets), Leszek Gasieniec and Paul Goldberg {The 
complexity of gene placement), David Jones {An introduction to computational 
molecular biology), Arthur Lesk {Structural alignment and maximal substructure 
extraction), Cenk Sahinalp {Quest for measuring distance between strings: exact, 
approximate, and probabilistic algorithms), and Jim Storer. 

Combinatorial Pattern Matching (CPM) addresses issues of searching and 
matching strings and more complicated patterns such as trees, regular expres- 
sions, graphs, point sets, and arrays. The goal is to derive non-trivial combina- 
torial properties of such structures and to exploit these properties in order to 
achieve superior performance for the corresponding computational problems. 

Over recent years, a steady flow of high-quality research on this subject has 
changed a sparse set of isolated results into a fully-fledged area of algorithmics. 
This area is continuing to grow even further due to the increasing demand for 
speed and efficiency that comes from important and rapidly expanding appli- 
cations such as the World Wide Web, computational biology, and multimedia 
systems, involving requirements for information retrieval, data compression, and 
pattern recognition. The objective of the annual GPM gatherings is to provide an 
international forum for research in combinatorial pattern matching and related 
applications. 

The general organisation and orientation of GPM conferences is coordinated 
by a steering committee composed of A. Apostolico, M. Grochemore, Z. Galil, 
and U. Manber. 

The first nine meetings were held in Paris (1990), London (1991), Tucson 
(1992), Padova (1993), Asilomar (1994), Helsinki (1995), Laguna Beach (1996), 
Aahrus (1997), and Piscataway (1998). After the first meeting, a selection of 
papers appeared as a special issue of Theoretical Computer Science in volume 
92. The proceedings of the third to ninth meetings appeared as volumes 644, 
684, 807, 937, 1075, 1264, and 1448 of the present LNGS series at Springer. 

GPM’99 was organised by Genk Sahinalp of the Department of Gomputer Sci- 
ence at Warwick University. The conference was supported in part by MATHFIT 
(a joint programme of EPSRG and the London Mathematical Society). 
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Shift- And Approach to Pattern Matching 
in LZW Compressed Text 



Takuya Kida, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa 



Department of Informatics, Kyushu University 33 
Fukuoka 812-8581, Japan 

{kida, takeda, ayumi, arikawa}@i . kyushu-u. ac . jp 



Abstract. This paper considers the Shift- And approach to the problem 
of pattern matching in LZW compressed text, and gives a new algorithm 
that solves it. The algorithm is indeed fast when a pattern length is 
at most 32, or the word length. After an 0(m -1- llfl) time and OdUl) 
space preprocessing of a pattern, it scans an LZW compressed text in 
0(n -I- r) time and reports all occurrences of the pattern, where n is the 
compressed text length, m is the pattern length, and r is the number of 
the pattern occurrences. Experimental results show that it runs approxi- 
mately 1.5 times faster than a decompression followed by a simple search 
using the Shift- And algorithm. Moreover, the algorithm can be extended 
to the generalized pattern matching, to the pattern matching with k 
mismatches, and to the multiple pattern matching, like the Shift-And 
algorithm. 



1 Introduction 

Pattern matching in compressed text is one of the most interesting topics in the 
combinatorial pattern matching. Several researchers tackled this problem. Eilam- 
Tzoreff and Vishkin |E| addressed the run-length compression, and Amir, Lan- 
dau, and Vishikin j0| , and Amir and Benson and Amir, Benson, and Farach 
^ addressed its two-dimensional version. Farach and Thorup Pj and G^sieniec, 
et al. HH addressed the LZ77 compression |IH|. Amir, Benson, and Farach Pj 
addressed the LZW compression PE|- Karpinski, et al. H21 and Miyazaki, et al. 

addressed the straight-line programs. However, it seems that most of these 
studies were undertaken mainly from the theoretical viewpoint. Concerning the 
practical aspect, Manber m pointed out at CPM’94 as follows. 

It is not clear, for example, whether in practice the compressed search 
in p) will indeed be faster than a regular decompression followed by a 
fast search. 

In 1998 we gave in tig an affirmative answer to the above question: We 
presented an algorithm for finding multiple patterns in LZW compressed text, 
which is a variant of the Amir-Benson-Farach algorithm p] , and showed that in 
practice the algorithm is faster than a decompression followed by a simple search. 



M. Crochemore, M. Paterson (Eds.); CPM’99, LNCS 1645, pp. 1-^21 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Namely, it was proved that pattern matching in compressed text is not only of 
theoretical interest but also of practical interest. We believe that fast pattern 
matching in compressed text is of great importance since there is a remarkable 
explosion of machine readable text files, which are often stored in compressed 
forms. 

On the other hand, the Shift-And approach [HIZIIIZI to the classical pattern 
matching is widely known to be efficient in many practical applications. This 
method is simple, but very fast when a pattern length is not greater than the 
word length of typical computers, say 32. In this paper, we apply this method 
to the problem of pattern matching in LZW compressed text and then give a 
new algorithm that solves it. Let m,n,r be the pattern length, the length of 
compressed text, and the number of occurrences of the pattern in the original 
text, respectively. The algorithm, after an 0{m + IT’D time and 0(|A'|) space 
preprocessing of a pattern, scans a compressed text in 0{n + r) time using 
0{n+m) space and reports all occurrences of the pattern in the original text. The 
0(r) time is devoted only to reporting the pattern occurrences. Experimental 
results on the Brown corpus show that the proposed algorithm is approximately 
1.5 times faster than a decompression followed by a search using the Shift-And 
method. Moreover, the algorithm can be extended to (1) the generalized pattern 
matching, to (2) the pattern matching with k mismatches, and to (3) the multiple 
pattern matching. 

We assume, throughout this paper, that m < 32 and that the arithmetic op- 
erations, the bitwise logical operations, and the logarithm operation on integers 
can be performed in constant time. 

The organization of this paper is as follows: We briefly sketch the LZW 
compression method, and the Shift-And pattern matching algorithm. We present 
our algorithm and discuss the complexity in Section 01 In Section 01 we show the 
experimental results in comparison with both an LZW decompression followed 
by a search using the Shift-And method and the previous algorithm presented 
in m- In Section |3 we shall discuss the extensions of the algorithm to the 
generalized pattern matching, to the pattern matching with k mismatches, and 
to the multiple pattern matching. 



2 Preliminaries 

We first define some notation. Let A, usually called an alphabet, be a finite set 
of characters, and A* be a set of strings over A. We denote the length of u € A* 
by |u|. We call especially the string whose length is 0 null string, and denote it 
by e. We denote by u[i] the Ah character of a string u, and by u[i : j] the string 
u[i\u[i + l]...u[j], I < i < j < |u|. For a set A of integers and an integer k, let 
A(Bk = {i + k\ iGA} and koA = {k — i\iG A}. 

In the following subsections we briefly sketch the LZW compression method 
and the Shift-And pattern matching algorithm. 
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Orig inal text [d[b\a b G bp a[b[c[a b a b c\a b a b 
Compressed text 1,2, 4, 4, 5, 2,3, 6, 9, 11 



Fig. 1. Dictionary trie. 



2.1 LZW Compression 

The LZW compression is a very popular compression method. It is adopted as 
the compress command of UNIX, for instance. It parses a text into phrases and 
replaces them with pointers to the dictionary. The dictionary initially consists 
of the characters in S. The compression procedure repeatedly finds the longest 
match in the current position and updates the dictionary by adding the concate- 
nation of the match and the next character. The dictionary is implemented as 
a trie structure, in which each node represents a phrase in it. The matches are 
encoded as integers associated with the corresponding nodes of the dictionary 
trie. The update of the dictionary is executed in 0(1) time by creating a new 
node labeled by the next character as a child of the node corresponding to the 
current match. 

FigureQshows the dictionary trie for the text abababbabcababcabab, assuming 
the alphabet E = {a, 6, c}. Hereafter, we identify the string u with the integer 
representing it, if no confusion occurs. 

The dictionary trie is removed after the compression is completed. It can be 
reconstructed from the compressed text. In the decompression, the original text 
is obtained with the aid of the recovered dictionary trie. This decompression 
takes linear time proportional to the length of the original text. However, if the 
original text is not required, the dictionary trie can be built only in 0(n) time, 
where n is the length of the compressed text. The algorithm for constructing the 
dictionary trie from a compressed text is summarized in Figure O 

2.2 The Shift-And Pattern Matching Algorithm 

The Shift-And pattern matching algorithm was proposed by Abrahamson 
Baeza-Yates and Gonnet 0, and Wu and Manber In the following, we 
present the algorithm according to the notation in 0. 

Let P = P[1 : m] be a pattern of length m, and T = T[1 : A] be a text of 
length N. For fc = 0, 1, . . . , A, let 

Afe = {l<i<m|i<A: and P[1 : i] = T[k — i -I- 1 : fc]}. 



( 1 ) 
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Input. An LZW compressed text U1U2 ■ . . Un- 
Output. Dictionary D represented in the form of trie. 
Method, 
begin 

D := T; 

for i := 1 to n — 1 do begin 
if ^ l-^^l then 

let a be the first character of Ui+i 

else 

let a be the first character of Ui; 

D ■.= Du {ui ■ a} 

end 

end. 



Fig. 2. Reconstruction of dictionary trie. 



and for any a G if, let 



M(a) = {1 < f < m I lP[f] = a}. (2) 

Definition 1. Define the function / : x S ^ by 

f{S,a)={{S(Bl)U{l})nM{a), 
where S C {1, ■ ■ ■ ^ m} and a G S. 

Using this function we can compute the values of i?fc for k = 1,2,... , by 

1. Ro = 0, 

2. Rk+i= f{Rk,T[k + l]) ik>0). 

For fc = 1, 2, . . . ,N, the algorithm reads the fc-th character of the text, computes 
the value of Rk, and then examine whether m is in R^. If m G Rk, then T[A: — 
TO + 1 : fc] = V, that is, there is a pattern occurrence at position fc — to + 1 of 
the text. Note that we can regard Rk as states of the KMP automaton, and / 
acts as the state transition function. 

When TO < 32, we can represent the sets Rk and M{a) as m-bit integers. 
Then, we can calculate the integers Rk by 

1. i?o = 0, 

2 . Rk+i = {{Rk « 1) + 1) & M{T[k + 1]) (fc > 0), 

where ’<C’ and denote the bit-shift operation and the bitwise logical product, 
respectively. We can get a pattern occurrence if i?fc&2™“^ 7 ^ 0. For example, the 
values of Rk for fc = 0 , 1 , . . . are shown in Figure 0 where T = abababbabcababc 
and V = ababc. 
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The time complexity of this algorithm is 0{mN). However, the bitwise logical 
product, the bit-shift, and the arithmetic operations on 32 bit integers can be 
performed at high speed, and thus be considered to be done in 0(1) time. Then 
we can regard the time complexity as 0{N) if m is at most 32 (in fact such a 
case occurs very often) . 



original text: 



a b a b 



a b b a b c 



a b a b c 



aOlOlOlOOlOOlOlO 0 
6001010100100101 0 
Rk-. a O^O^O^l^O^l^O^O^O^O^O^O^O^l^O^O 
6000010100000001 0 
cOOOOOOOOOOOOOOO 1 

A 



Fig. 3. Behavior of the Shift-And algorithm. 

The symbol A indicates that a pattern occurrence is found at that 
position. 



3 Proposed Algorithm 

We want to design a new pattern matching algorithm that runs on an LZW com- 
pressed text and simulates the behavior of the Shift-And algorithm on the origi- 
nal text. Assume that the text is parsed as uiU 2 . . . Un- Let ki = \u\U 2 ■ ■ - Ui\ for 
t = 0, 1, . . . , n. Our idea is to compute only the values of for z = 1, 2, . . . , n, 
to achieve a linear time complexity which is proportional not to the original text 
length N but to the compressed text length n. 

Definition 2. Let / he the function f extended to x S* by 

f{S,e) = S and f{S,ua) = f{f{S,u),a), 
where S' C {1, • • • , m\, u € S* and a G S. 

Lemma 1. Suppose that the text is T = xuy with x,u,y G E* and u ^ e. Then, 

^\xu\ /(-S|a,| , Zz). 

Proof. It follows directly from the definition of /. ■ 

Let D be the set of phrases in the dictionary. If we have the values of / for 
the domain x D, we can compute the value = f{Rki,Ui+\) from 

Rk, and Ui+\ for each z = 0, 1, . . . , n — 1. As shown later, we can perform the 
computation only in 0(1) time by executing the bit-shift and the bitwise logical 
operations, using the function M defined as follows. 
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Definition 3. For any u G E* , let M{u) = /({I, . . . , m}, u). 

Lemma 2. For any S' C {1, , m} and any u G E* , 

J{S,u) = ((S 0 |m|) U {1,2, .. . , |m|}) n M(m). 

Proof. By induction on |u|. It is easy for u = e. Suppose u = u'a with u' G E* 
and a € E. We have, from the induction hypothesis, 

/(S, u') = ((S © |u'|) U (1, 2, . . . , |m'|}) n M{u'). 

It follows from the definition of / that, for any Si, S 2 C {1,2,... , m} and for any 
aG 2 :, /(SinS 2 ,a) = /(Si,a)n/(S 2 ,a) and /(Si U S 2 , a) = /(Si, a) U /(S 2 , a). 
Then, 

f{S,u) = (/(S © \u'\,a) U /({1,2, .. . , Im'II.o)) n f{M{u'),a) 

= ((S® |m|)U{ 1,2,... ,|u|})nM(M). 



Lemma 3. The function which takes as input u € D and returns in 0(1) time 
the m-bit representation of the set M{u), can he realized in 0(|0| + m) time 
using 0(|0|) space. 

Proof. Since M{u) C {1, . . . ,m}, we can store M{u) as an m-bit integer in the 
node u of the dictionary trie D. Suppose u = u'a with u' G D and a G E. M{u) 
can be computed in 0(1) time from M{u') and M{a) when the node u is added 
to the dictionary trie since M{u) = f{M{u'),a) = © l) U {!}) n M{a). 

Since the table M{a) is computed in 0(|i7| + m) time using 0(|L'|) space and 
E C D, the total time and space complexities are 0(|0| + m) and 0(|I?|), 
respectively. ■ 

Now we have the following theorem from Lemmas 1, 2, and 3. 

Theorem 1. The function which takes as input (S,u) G 2^ D and 

returns in 0(1) time the m-hit representation of the set f{S,u), can he realized 
in 0(|0| + m) time using 0(|I?|) space. 

Since \D\ = 0(n), we can perform in 0(n + m) time the computation of 
. for i = 1 , . . . ,n by executing the bit-shift and the bitwise logical opera- 
tions. However, we have to examine whether m G Rj for every j = 1,2,... , N. 
For a complete simulation of the move of the Shift-And algorithm, we need a 
mechanism for enumerating the set Output{Rki,Ui+i) defined as follows. 

Definition 4. For S C {1, . . . , m} and u G D, let 



Output{S,u) = {l<t<|M| m G f{S,u[l : t])}. 
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To realize the procedure enumerating the set Output, we define the following 
sets. 

Definition 5. For any u G D, let 

U{u) = {l < i < |u| I i < m and m G M{u[l : i])}, and 
V{u) = [l <i <\u\\i>m and m G M{u[l : t])}. 

Then, we have the following lemma. 

Lemma 4. For any S' C {1, ... , m} and any u G S* , 

Output{S,u) = ((to © S) n U{u)) U V{u). 

Proof. By Lemma 2 and Definitions 4 and 5, we obtain: 

Output{S, u) = {1 < i < \u\ \ i < m and m G (S © z) fl M{u[l : z])} 

U{1 < z < |zz| I TO < z and to G M{u[1 : z])} 

= {{mOS)nU{u))uV{u). 



Since U{u) C {!,... ,m}, we can store the set U{u) as an m-bit integer in the 
node u of the dictionary trie D. 

Lemma 5. The function which takes as input u G D and returns in 0(1) time 
the m-bit representation ofU{u), can be realized in 0(|0|+to) time using 0(|0|) 
space. 

Proof. By the definition of U, for any u = u'a with u' G S* and a G S, 

U{u) = U{u') U {|u| I |zz| < TO and m G M{u)}. 

Then, we can prove the lemma in a similar way to the proof of Lemma 01 ■ 

To eliminate the cost of performing the operation 0 in (to 0 S) H U{u), we 
store the set U'{u) = mQU{u) instead of U{u). Then, we can obtain the integer 
representing the set S n U'{u) by one execution of the bitwise logical product 
operation. For an enumeration of the set, we repeatedly use the logarithm op- 
eration to find the leftmost bit of the integer that is one. Assuming that the 
logarithm operation can be performed in constant time, this enumeration takes 
only linear time proportional to the set size. 

Next, we consider V (u). Since the set V (u) cannot be represented as an TO-bit 
integer, we shall represent it as a linked list as shown in the proof of the next 
lemma. 

Lemma 6. The procedure which takes as input u G D and enumerates the set 
V{u), can be realized in 0{\D\ + m) time using 0{\D\) space, so that it runs in 
linear time with respect to |y(zz)|. 



Takuya Kida et al. 



Proof. By the definition of V, for any u = u'a with u' G S* and a G E, 

V{u) = V{u') U ||m| I to < |m| and to G 

We use the function Prev(u) that returns the node of the dictionary trie D that 
represents the longest proper prefix u of it such that |i;| G V{u). Then, we have 

V(u) = V{Prev{u)) U {|u| | TO < |u| and TO G M(it)}. 

The function Prev(u) can be realized to answer in 0(1) time, using 0(|I?|) time 
and space. Therefore it is sufficient to store in every node u of the dictionary 
trie D the value Prev{u) and the boolean value inJPfu) indicating whether 
|u| G V{u). The proof is now complete. ■ 

From Lemmas 4, 5, and 6, we have the following theorem. 

Theorem 2. The procedure which takes as input {S,u) G x D and 

enumerates the set Output{S,u), can be realized in 0(|I?| + to) time using 0(|0|) 
space, so that it runs in linear time with respect to \Output{S,u)\. 

Now we can simulate the behavior of the Shift-And algorithm on an un- 
compressed text completely. The algorithm is summarized as in Figure 01 The 
behavior of the new algorithm is illustrated in Figure El 

Theorem 3. The algorithm of Figure^ runs in OdFlj -|- to -I- n -I- r) time using 
0(1 Al -I- TO -I- n) space, where r is the number of pattern occurrences. 



4 Experimental Results 



In order to estimate the performance of the proposed algorithm, we carried out 
some experiments on the following four methods. 



Method 1. 
Method 2. 
Method 3. 
Method 4. 



A decompression followed by the Shift-And algorithm. 

Our previous algorithm presented in [E|. 

The new algorithm proposed in this paper. 

Searching the uncompressed text, using the Shift-And algorithm. 



In our experiments we used the Brown corpus as the text to be searched. The 
uncompressed size is about 6.8Mb and the compressed size is about 3.4Mb. The 
experiments were performed in the following two different situations. 



Situation 1. Workstation (SPARCstation 20) with remote disk storage. The 
file transfer ratio is 0.96 Mbyte/sec. 

Situation 2. Workstation (SPARCstation 20) with local disk storage. The file 
transfer ratio is 3.27 Mbyte/sec. 
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Input. An LZW compressed text uiU 2 ---Un and a pattern V. 
Output. All positions at which V occurs. 

begin 

/* We represent the set V(u) by the functions Prev{u) and inJV{u). 
See the proof of Lemma 6. */ 

/* Preproeessing */ 

Construct the table M from P; 

D 0; U'{e) 0; inj/{e) := false-, Prev{e) := e; 

for each a G S do call Update(e, a); 

/* Text scanning */ 

fc := 0; R := 0; 

for f := 1 to n do begin 

call Update{ue-i,ue)-, /* We assume uo = £•/ 
for each p € (RH U'{ue)) U V{ui) do 

report a pattern occurrence at position fc + p — m + 1; 

R := {{R © \ui\) U {1, 2, . . . , |}) n M(«r); 

k := k + \ue\ 

end 

end. 

procedure Update{u, v) 

begin 

if u < \D\ then 

let a be the first character of v 

else 

let a be the first character of u; 

D D U {u ■ a}; 

M(u ■ a) ■- © 1) U {!}) D M(a); 

if |u • a| < m then 

if m € M(u ■ a) then 

U'(u ■ a) := U'(u) U {m — \u ■ a|} 

else 

U'(u-a) ■- U'(u) 
else begin 

U'{u ■ a) := 0; 
if m G M{u ■ a) then 
inJV{u ■ a) := true 

else 

inJV{u ■ a) := false-, 
if inJV{u) = true then 
Prev{u ■ a) -.= u 

else 

Prev{u ■ a) := Prev(u) 

end 

end; 



Fig. 4. Pattern matching algorithm in LZW compressed text 
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original text: a 

compressed text: 1 

a 0 1 

b 0 0 

Rk- a 0 — > 0 

b 0 0 

c 0 0 



b ab ab ba b 

2 4 4 5 2 



c aba be 

3 6 9 



00010010 
1110 10 0 0 

0 — >0 — >0 — >0 — >0 — >0 — >1 — >0 

0 1 1 0 0 0 0 0 

00000001 



Output{Rk, ui)-. 



{ 2 } 



Fig. 5. Behavior of the algorithm. 



Table 1. CPU time and elapsed time. 



method 


CPU time (sec) 


elapsed time (sec) 
Situation 1 Situation 2 


Method 1 


7.52 


8.16 


7.62 


Method 2 


6.57 


7.31 


6.83 


Method 3 


5.15 


6.05 


5.41 


Method 4 


3.09 


9.36 


3.25 



The searching times, measured in both the CPU time and the elapsed time, 
are shown in Table 1, where we included the preprocessing time. 

Although the time complexities of our algorithms are linear with respect to 
the compressed text size n not to the original size N , the LZW compression 
of typical English texts normally gives n = N/2 and thus the constant factor 
is crucial. It is observed from Table 1 that, in the CPU time comparison, our 
algorithms (Methods 2 and 3) are slower than the uncompressed case (Method 4) 
whereas they are faster than a decompression followed by a search (Method 1). 
It is also observed that the new algorithm (Method 3) is about 1.3 times faster 
than the previous one (Method 2). 

In general, the searching time is the sum of (1) the file I/O time and (2) 
the CPU time consumed for compressed pattern matching. Text compression 
reduces the file I/O time at the same ratio as the compression ratio while it may 
increase the CPU time. When the data transfer is slow, we have to give a weight 
to the reduction of the file I/O time, and a good compression ratio leads to a fast 
search. In fact, even a decompression followed by a simple search (Method 1) 
was faster than the uncompressed search (Method 4) in Situation 1. It should 
be noted that, in this situation, the previous algorithm (Method 2) and the new 
algorithm (Method 3) are faster than the uncompressed case (Method 4), and 
especially the latter is approximately 1.5 times faster than the uncompressed 
case. 

On the contrary, in the situations that the data transfer is ralatively fast, the 
CPU time becomes a dominant factor. It is observed that, like in the CPU time 
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comparison, Methods 2 and 3 are slower than Method 4 while they are faster 
than Method 1 in the elapsed time comparison in Situation 2. 

Thus we conclude that, for the LZW compression, the compressed search is 
indeed faster than a decompression followed by a fast search, and that the Shift- 
And approach is effective in the LZW compressed pattern matching. When the 
data transfer is slow, e.g. network environments, the compressed search can be 
faster than the uncompressed search. 

5 Extensions 

In this section, we mention how to extend our algorithm. 

5.1 Generalized Pattern Matching 

The generalized pattern matching problem Q is a pattern matching problem in 
which a pattern element is a set of characters. For instance, (b -t- c -|- h + l)ook 
is a pattern that matches the strings book, cook, hook, and look. Formally, let 
A — {X C A I A yf 0} and V = Xi...Xm {Xi S A). Then we want to find all 
integers i such that T[i : i + m — 1] &V. 

It is not difficult to extend our algorithm to the problem. We have only 
to modify some equations: For example, we modify Equations O and @ in 
Section o as follows. 



5.2 Pattern Matching with k Mismatches 

This problem is a pattern matching problem in which we allow up to k characters 
of the pattern to mismatch with the corresponding text mu. For example, if 
k = 2, the pattern pattern matches the strings postern and cittern, but does 
not match eastern. The idea stated in m fo solve this problem is to count up 
the number of mismatches using |" m log 2 m] bits instead of using one bit to see 
whether 'P[i\ = T[k\. This technique can be used to adapt our algorithm for the 
problem. 

5.3 Multiple Pattern Matching 

Suppose we are looking for multiple patterns in a text. One solution is to keep 
one bit vector R per pattern and perform the Shift-And algorithm in parallel, 
but the time complexity is linearly proportional to the number of patterns. The 
solutions in 0 in HD are to coalesce all vectors, keeping all the information 
in only one vector. Such technique can be used to adapt our algorithm for the 
multiple pattern matching problem in LZW compressed text. 



Afc = {l<i<TO|P[l:i]3 T[k — i + 1 : A:]}, 
M (a) = < i < m \ V[i] 3 a} . 



(H) 

0 ) 
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6 Conclusion 

In this paper we addressed the problem of searching in LZW compressed text 
directly, and presented a new algorithm. We implemented the algorithm, and 
showed that it is approximately 1.5 times faster than a decompression followed by 
a search using the Shift- And algorithm. Moreover we showed that our algorithm 
has several extensions, and is therefore useful in many practical applications. 
Some future directions of this study will be extensions to the pattern matching 
with k differences, and to the regular expression matching, and will be to develop 
a compression method which enables us to scan compressed texts faster. 
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Abstract. We address the problem of string matching on Ziv-Lempel 
compressed text. The goal is to search a pattern in a text without un- 
compressing it. This is a highly relevant issue to keep compressed text 
databases where efficient searching is still possible. We develop a gen- 
eral technique for string matching when the text comes as a sequence of 
blocks. This abstracts the essential features of Ziv-Lempel compression. 
We then apply the scheme to each particular type of compression. We 
present the first algorithm to find all the matches of a pattern in a text 
compressed using LZ77. When we apply our scheme to LZ78, we obtain 
a much more efficient search algorithm, which is faster than uncompress- 
ing the text and then searching on it. Finally, we propose a new hybrid 
compression scheme which is between LZ77 and LZ78, being in practice 
as good to compress as LZ77 and as fast to search in as LZ78. 



1 Introduction 

String matching is one of the most pervasive problems in computer science, with 
applications in virtually every area. It is also one of the oldest and richest area of 
development. The string matching problem is: given a pattern P = pi...pm and a 
text T = t\...tu, both sequences of symbols over a finite alphabet S of size a, find 
all the occurrences of P in T. There are many algorithms to solve this problem, 
from classical to very recent [la El El 13 E3 El ESI The complexity of this 
problem is 0{u) in the worst case and 0{u\og{m) / m) on average, where u = |T| 
and TO = |P|, and there exist variants of |EIE| which achieve this complexity. In 
practice, however, |2Z1 ESj are the fastest algorithms in most cases. 

Another old and rich area in computer science is text compression. Its aim is 
to exploit the redundancies of the text to reduce its space usage. There are many 
different compression schemes jnj, among which the Ziv-Lempel family |TnrT2j 
is one of the best in practice because of their good compression ratios combined 
with efficient compression and decompression times. Other compression schemes 
are Huffman coding 13 and arithmetic coding EHI, among others. 



M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. 14-|2^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Today’s textual databases are an excellent example of applications where 
both problems are crucial: the texts should be kept compressed to save space 
and I/O time, and they should be efficiently searched. Surprisingly, these two 
combined requirements are not easy to achieve together, as the only solution 
before the 90 ’s was to process queries by uncompressing the texts and then 
searching into them. 

The compressed matching problem was first defined by Amir and Benson ^ 
as the task of performing string matching in a compressed text without decom- 
pressing it. Given a text T, a corresponding compressed string Z = z\ . . . Zn, and 
a pattern P, the compressed matching problem consists in finding all occurrences 
of P in T, using only P and Z . A naive algorithm, which first decompresses the 
string Z and then performs standard string matching, takes time 0(u + m). 
An optimal algorithm takes worst-case time 0{n + to), where n = jZj. In a 
new criterion, called extra space, for evaluating compressed matching algorithms, 
was introduced. According to the extra space criterion, algorithms should use at 
most 0(n) extra space, optimally 0{m) in addition to the n-length compressed 
file. 

We define now a variation where we are required to report all the matching 
positions. That is, given P and Z, report all the |a;| such that T = xPy. The 
optimal algorithm for this problem takes 0(m + n + R) time, where R is the 
number of matches. 

Two different approaches have emerged in the last years to combine com- 
pression and searching in textual databases. A first one is strongly oriented to 
natural language texts, which are assumed to be composed of words which fol- 
low some statistical rules. The basic idea is to compress the text using Huffman, 
where the words instead of the characters are taken as the symbols [Z1IZ2|. As 
Huffman assigns a fixed code to each symbol, searching a given string is a matter 
of compressing it and searching it in the compressed text using a classical string 
matching algorithm with minor modifications Despite its simplicity, this 

approach is very effective on natural language text, with better compression ra- 
tios than those of the Ziv-Lempel family, and search time which is between 2 
and 8 times faster than the fastest algorithms for standard string matching over 
the uncompressed text. They are also able to search for complex patterns (such 
as regular expressions) and allow errors in the matches, provided that words are 
matched against words. The average search time for a simple pattern is close to 
0{m + nlog{u/n)/{u/n)). The extra space is 0{y/u), which is the same space 
necessary to decompress the text. A weakness of this scheme is that it does not 
work well on small texts (say, less than 10 Mb), since in that case the vocabulary 
is almost as big as the text itself. Also, it can be applied only to natural language 
texts. 

Another practical approach is an ad-hoc technique EDI, which however is 
not so fast, obtains compression ratios of near 70% (against 30% to 40% of 
Ziv-Lempel algorithms), and relies on the ASCII encoding. 

The second line of research considers Ziv-Lempel compression, which is based 
on finding repetitions in the text and replacing them with references to similar 
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strings previously appeared. LZ77 is able to reference any substring of the text 
already processed, while LZ78 references only a single previous reference plus 
a new letter that is added. In both cases, the referenced text to be found is 
normally limited by a window which precedes the current text position. 

String matching in Ziv-Lempel compressed texts is much more complex, since 
the pattern can appear in different forms across the compressed text. In j2j a 
compressed matching algorithm for LZ78 is presented, which works in time and 
space 0(m^ + n). For LZ77, the only result is El, which is a randomized algo- 
rithm to determine in time 0{m + n log^ (u/n)) whether a pattern is present or 
not in an LZ77-compressed text, but they do not find all the pattern occurrences. 
Other algorithms for different specific search problems have been presented in 
I3IT7]. This second branch is rather theoretical and, to the best of our knowl- 
edge, no actual implementations have been developed. 

In this paper we aim at efficient algorithms for string matching on Ziv-Lempel 
compressed texts. We present new theoretical developments but also give prac- 
tical implementations and experiments on our algorithms. Our main results are 

— We develop a general technique for string matching on a text which is given 
as a sequence of blocks. This abstracts the essential features of Ziv-Lempel 
compressed texts and is the basis for the algorithms which run over specific 
members of the family. 

— We apply our technique to LZ77-compressed texts. The result is the first 
algorithm to search under this compression scheme (recall that cannot 
find all the occurrences of the pattern). The algorithm, however, is 0{u) 
time at best. In practice, the algorithm is slower than uncompressing the 
text and searching it with a classical algorithm. 

— We apply the technique to the LZ78 compression scheme. The result is an 
algorithm which turns out to be a practical implementation of the theoretical 
proposal of |2|. This algorithm is 0(n + R) time in the worst and average 
case, and is in practice twice as fast as decompressing and searching. 

— We propose a hybrid compression scheme which is between LZ77 and LZ78, 
which keeps some of the good features of LZ77 and which can be searched 
in 0(min(u, nlogm) -|- R) time on average (and 0(min(M, mn) -I- R) in the 
worst case). In practice, the compression efficiency is similar to LZ77 and 
the search time is similar to LZ78. 

In all cases our preprocessing cost is 0{a+m) and our extra space is 0{n+R), 
almost the same necessary to decompress the text. Our approach is practical and 
relies on bit-parallelism. Bit-parallelism is a general technique to take advantage 
of the fact that the computer operates in parallel over all the bits of the machine 
word, so that if a process is so simple that it can be expressed with bit operations 
we can perform many of those steps in a single operation of the processor. If we 
call w the length in bits of the machine word (typically 32 or 64), then the 
possible speedups are up to 0{w). The complexity results presented assume 
that m = 0{w), otherwise we have to multiply the u and n of our complexities 
by m/w. 
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2 String Matching on Blocks 

We describe now a general technique for string matching when the text is pre- 
sented as a sequence of atomic strings (here called “blocks” ) instead of a sequence 
of characters. This technique is the basis for all the different searching algorithms 
on Ziv-Lempel compressed text, which are described in the next sections. 

Our general assumption is that the blocks either have just one letter (that 
we can access directly) or are formed by a concatenation of previously seen 
blocks. We describe an online algorithm where we process the text block by 
block. At any moment of the search we denote T' the text already processed (of 
\T'\ characters). When we finish the search, T' = T, i.e. the original text. 

The method works as follows. We process the blocks one by one. For each 
new block B, we compute a description for B which has all the information of 
the block which is relevant for the search. This description is denoted D{B) = 
(L, O, S', P, M), where 

— L = \B\, that is, the length of B in characters; 

— O = OfFs(P) = the length in characters of the text we had processed when 
B appeared; 

— S = Suff(P) = all the pattern position^ which either start a complete 
occurrence of B inside the pattern, or start a proper pattern suffix which 
matches with a prefix of B. Formally, 

Suff(i3) = {\x\,P = xBy} U {\x\,\x\ > 0 A \z\ > 0 A P = xz A B = zy} ■, 

— P = Pref(P) = all the pattern positions which either follow a complete 
occurrence of B inside the pattern, or follow a proper pattern prefix which 
matches with a suffix of B. Formally, 

Pref(P) = {|a;i?|, P = xBy A \y\ > 0} U 

{|z|, |^:| > 0 A |y| > 0 A P = zy A B = x } ] 

— M = Matches(P) = all the block positions where the pattern occurs (0 if 
\B\ < |P|). Formally, 

Matches(P) = {\x\, B = xPy} . 

Figured illustrates these concepts. 

The description D{B) of a new block B is obtained in two forms: (a) the 
block is an explicit letter and then we obtain the description directly, or (6) the 
block is a concatenation of other blocks previously known, and we obtain its 
description by operating on the descriptions of the previous blocks. 

Once the description of the new block is computed, we use that description 
to update the state of the search. This concludes the processing of the block and 
we move to the next one. The state of the search contains the matches that have 
already occurred and the potential matches in progress, that is, 

^ To simplify the notation, we number pattern positions starting at zero. 
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Fig. 1. Prefixes (P) and suffixes (S) for a long and a short block. The pattern has 
the diagonal tiling and the possible blocks have a bar tiling. The suffixes (dotted 
lines) and prefixes (dashed lines) are pattern positions. Prefixes are marked after 
the position where they finish, suffixes are marked at the position they start. 



— Res(T') = the text positions that matched up to now, formally 

Res(T') = {\x\, r = xPy} ■, 

— Active(T') = the set of positions following the pattern prefixes which match 
a suffix of the current text. Formally, 

Active(r') = {|x|, |x| > 0 A |y| > 0 A P = xy A T' = zx} . 

Hence, when we complete the text processing and T' is not a text prefix 
anymore but the whole text, Res(T) is our answer. The initial state of the search 
is T' = e, and Res(e) = Active(e) = 0. 

We have defined already the information we keep, and consider now how to 
compute that information. For the formulas that follow, we define some auxiliary 
functions, namely 

— Lefti(A) = {x — f, a; G A} U {to — f, TO — f+ 1, ..., m— 1}, which receives 
a set of Suff() positions not smaller than i, subtracts i to all them and then 
adds new pattern positions filling the hole left by the shift. 

— Rightj(A) = {x + i, X € X} U {1, 2, . . . , z}, which does the same for 
Pref() positions, in the other direction. 

— Addi(A) = {i + X, X G A}, which adds z to all the elements of the set. 

— Subtri(A) = {i — X, X G A}, which subtracts all the elements of the set 
from z. 



2.1 Description of a Letter 

The base case of our scheme is to obtain the description of a block which is a 
letter a. The following is obtained by direct application of the general formulas. 



-\B\ = I 

- Offs(R) = 

- Suff(R) = 

- Pref(R) = 

- Matches (R) 



\r\ 

||a;|, P = xay} 

||a;a|, P = xay A \y\ > 0} 
= if P = a then {0} else 0 
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2.2 Concatenating Two Blocks 

Assume that our block B is defined as the concatenation of one or more previous 
blocks. If only one previous block B' is referenced, we just copy its definition. We 
show now how to concatenate two blocks, since the case of more than two blocks 
is a simple iteration over this procedure. We are given two blocks Bi and B2, and 
we have to obtain the description for their concatenation D{B) = D{BiB2) = 
D{Bi) ■ D{B2) (where we define • as the concatenation of block descriptions). 
The formulas are as follows 

-\B\ = m + m 

- Offs(B) = \T'\ 

- Suff(B) = Suff(Si) n Left|s^|(Suff(B2)) 

- Pref(S) = Pref(B2) n Rightis^i (Pref(Bi)) 

— Matches(i?) = Matches(i?i) U Add|s^|(Matches(i 32 )) 

U (Subtr|Bi|(Pref(Si) n Suff(S 2 )) n { 0 , 1 , 2 ,... ,|B| — m}) 

We explain now the rationale for the formulas (see Figure I 3 ). The first two 
are immediate. For Suff(i?), note that Suff(Sii 32 ) considers that either a prefix 
of B\ may be a suffix of P or B\ may be completely inside P followed by a prefix 
of B2 matching the a suffix of P. That is, if the number i belongs to Suff(i?ii?2) 
then either 

— i > m— |i?i I, that is, a prefix of B1B2 is a suffix of P. Notice that in this case 

also a prefix of Bi is a suffix of P. Since Left|Si| these positions, 

they will appear in the result if and only if they are present in Suff(Bi), 
which is correct. 

— i < m— |i?i|, that is, Bi appears inside P and is immediately followed by an 
occurrence of B2 (which can be a complete occurrence or share a prefix with 
the pattern suffix). If we subtract |i?i| to the elements in Suff(i?2), then we 
are interested in the positions which also appear in Suff(Bi) (which since 
i < m — \Bi \ can only correspond to complete occurrences of Bi). 




Fig. 2. Suffixes of the concatenation of two blocks. It is possible that the result 
involves only Bi (rightmost pair) or that it involves both. In this case B\ is 
completely inside the pattern and B2 may or may not be totally inside (leftmost 
and middle pairs, respectively). 
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The rationale for Pref() is analogous to Suff(). For Matches(S), there are 
three parts. The first one is the matches which are inside Bi, and the second one 
is the same for B2 (displaced since now B2 comes after B\ in B). The third one 
accounts for matches that appear only when Bi and B2 are concatenated. If a 
prefix of the pattern is at the end of Bi, and the corresponding suffix is at the 
beginning of B2, then we have the pattern in The Subtr converts pattern 

to block positions and the final set which is intersected with the results ensures 
that we have really prefixes and suffixes instead of substrings of the blocks. 



2.3 Updating the Search State 

We want now to update the state of our search by processing a new block B 
whose description has just been computed. The formulas to obtain the new 
Res(r'B) and Active(T' B) values from the old Res(T') and Active(T') ones are 

— Active(T'R) = Right |^|(Active(T')) n Pref(R) 

— Res(T'R) = Res(r') U Add|T'|(Matches(R)) U 

Subtr|y/| (Active(r') n Suff(R) ft {m— |R|, m— |R| + 1 , . . . , m— 1 }) 

The new Active(T'R) value considers that, since a new block B has been 
added to T', the pattern prefixes that are suffixes of T'B are those that are 
already suffixes of B (i.e. Pref(R)), or those which are suffixes of T' and are 
followed by B in the pattern. As before, Right does the trick of considering both 
cases in a single formula. 

The new value Res(T'R) adds to Res(T') not only the matches which are com- 
pletely inside B, but also those which appear when T' is concatenated to B. For 
this sake, we consider pattern prefixes which are suffixes of T' (i.e. Active(T')), 
and which are followed by the corresponding pattern suffix in B. The final in- 
tersection ensures that the complete pattern has appeared. Figure 0 illustrates. 



r 




p p 



Fig. 3. Updating the state of the search. In the first case we illustrate the up- 
dating of Active(T') (a short block is added). In the second case we show how 
the matches are updated (when a long block is added) . In general both updates 
are necessary. 



3 A Bit-Parallel Implementation 

Until now, we have defined our algorithms in terms of sets of pattern positions. 
We present now a very well-suited implementation paradigm which allows to 
convert the previous algorithms into efficient implementations. 
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We use the technique called bit-parallelism PI- This technique takes advan- 
tage of the fact that the processor works in parallel on all the bits of the computer 
word. We call w the number of bits of the computer word, which is 32 or 64 in 
current architectures. If one is able to map the elements of a set on bits, and to 
express the operations to perform on them by using only the operators provided 
by the processor (which are rather limited, i.e. bit shifts, masking, etc.), then 
one can effectively parallelize the work on the set, obtaining speedups of up to 
0{w) over the original algorithm. 

This paradigm was invented in 1989 by Baeza-Yates and Gonnet P] for a 
text searching algorithm called Shift-Or. If we consider m < w, then we keep 
the state of the search in a computer word D, whose i-th bit tells whether the 
prefix of length i of the pattern matches the current text suffix. All the bits start 
with value zero, and a match is reported whenever the m-th bit of D signals a 
match. The update formula upon reading a new text character is 

D' = {D « 1) I S[a] 

where S' [a] is a mask whose f-th bit tells whether Pi = a, we are assuming that 
0 represents a match and ala mismatch, “|” is the bitwise-or of the computer 
word, and “<< P is a bit shift operation which assigns the f-th bit to the 
{i f)-th, setting the first £ bits to zero. Other operations allowed in most 
architectures are bitwise-and (&), shift to the other direction (>>), and, which 
is more sophisticated, arithmetic operations such as addition and subtraction 
which operate on the bit mask as if it were a number. 

The Shift-Or algorithm is 0{n) provided m < w. li the computer word is 
too short to hold one bit per pattern position, then \m/w~\ computer words are 
used for the simulation, and the search takes in the worst case 0{mn/w) time. 
It is not hard to show that on average it takes 0{n), since 0(1) computer words 
have active states on average. 

Our implementation can indeed be seen as a Shift-Or algorithm working on 
blocks instead of letters. The sets Pref(B), SufF(B), and Active(T') are repre- 
sented by bit masks. Hence, for blocks of one letter a we have Suff(S) = S'[a] and 
Pref(H) = (^[a] << 1). The formulas to concatenate blocks are directly trans- 
lated by noticing that Left^ and Rights are converted into “>> and “<< F’, 
respectively (taking care of the borders which must get active bits), and union 
and intersection are converted into “|” and respectively. Hence, all those 
operations on sets are performed in 0(1) time if m < w, and 0{m/w) time in 
general. In practical text searching we can assume m = 0(w). 

On the other hand, the sets Res(T') and Matches(R) are explicitly stored 
in an array. However, it is not difficult to see that the total amount of work to 
handle them is 0{R), where R is the number of occurrences of the pattern in 
the text. The cost cannot be o{R) if we report all the occurrences. 

Hence, if f(n) concatenations are performed along all the process, our total 
search cost is 0{f{n) R). The value of f{n) depends on the compression algo- 

rithm. We have also to add a preprocessing cost to build the S') ] table, which is 

0{a to ). 
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In all cases, the space complexity of our algorithms is 0{n + R), since we 
need to store the descriptions of the blocks already seen and the matches found. 
Notice that this n refers in fact to the size of the compression window, and the 
R to the matches present in that window only. 

Finally, we consider the practical problem of uncompressing a neighborhood 
of the occurrences. In practice it is undesirable that we just give the text posi- 
tions matching the pattern. It is much better to uncompress and show a neigh- 
borhood of the match. This neighborhood can be defined as the line holding 
the occurrence, the record (delimited by some given pattern), a fixed number of 
characters, etc. 

Assume that we know a pattern position and want to show a neighborhood. 
We just decompress the surrounding blocks forward and backward, until from the 
plain text obtained we determine that the neighborhood has been decompressed. 
To decompress a block we have two cases: (o) the block is a letter, in which case 
we deliver the letter, (6) the block is a concatenation of other blocks, in which 
case we decompress each of those blocks in turn. This process takes 0{N) time 
at most (where N is the size of the decompressed neighborhood), since at each 
step we either obtain one character of N or split the final text to be obtained, 
and it is not possible to split it more than 0{N) times. This shows that it is 
practical to show a part of a Ziv-Lempel compressed file without necessarily 
uncompressing the whole file. 



4 LZ78 Compression 

4.1 Compression Algorithm 

The Ziv-Lempel compression algorithm of 1978 (usually named LZ78 jS2j) is 
based on a dictionary of blocks, in which we add every new block computed. 
At the beginning of the compression, the dictionary contains a single block bo 
of length 0. The current step of the compression is as follows: if we assume 
that a prefix t\ . . .ti of T has been already compressed in a sequence of blocks 
Z — b\ .. .be, all them in the dictionary, then we look for the longest prefix of the 
rest of the text ti+i . . .tu which is a block of the dictionary. Once we found this 
block, say bk of length Ik, we construct a new block 6c+i = (fc, we write 

the pair at the end of the compressed file Z, i.e Z = bi . . . bcbc+i, and we add 
the block to the dictionary. It is easy to see that this dictionary is prefix-closed 
(i.e. any prefix of an element is also an element of the dictionary) and a natural 
way to represent it is a trie. 

We give as an example the compression of the word ananas in Figure 0 The 
first block is (0,a), and next (0,n). When we read the next a, a is already the 
block 1 in the dictionary, but an is not in the dictionary. So we create a third 
block (1, n). We then read the next a, a is already the block 1 in the dictionary, 
but as do not appear. So we create a new block (1, s). 

The compression algorithm is 0(u) in the worst case and efficient in practice 
if the dictionary is stored as a trie, which allows rapid searching of the new text 
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Prefix encoded 
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Compressed file 


(0,a) 


(0,a)(0,n) 


(0,a)(0,n)(l,n) 


(0,a)(0,n)(l,n)(l,s) 



Fig. 4. Compression of the word ananas with the algorithm LZ78. 



prefix (for each character of T we move once in the trie). The decompression 
needs to build the same dictionary (the pair that defines the block c is read 
at the c-th step of the algorithm), although this time it is not convenient to 
have a trie, and an array implementation is preferable. Compared to LZ77, the 
compression is rather fast but decompression is slow. LZ78 is used by Unix’s 
Compress program. 

Many variations on LZ78 exist, which deal basically with the best way to code 
the pairs in the compressed file, or with the best way to update the window. A 
particularly interesting variant is from Welch, called LZW EH! In this case, the 
extra letter (second element of the pair) is not coded, but it is taken as the first 
letter of the next block (the dictionary is started with one block per letter). A 
variant over this is presented by Miller and Wegman (which we call LZMW), 
where the new block is not the previous one plus the first letter of the new one, 
but simply the concatenation of the previous and the new one. 

4.2 Pattern Matching in LZ78 Compressed Files 

Our general algorithm for searching in a sequence of blocks Z = bi . . .bn can 
be directly applied if we consider the new letter added after each block created 
by the LZ78 compression algorithm as a separate block. That is, each new pair 
{k,a) read at step c is taken as a reference to a previous block (6^) followed by 
a literal block (a). Hence, we compute the description of the concatenation of 
bk and a and add it as the new block be to our dictionary. At the same time, 
we update the state of the search using the description of be just computed. Of 
course, in practice we manage this one-letter block in a special way, to speed-up 
the block concatenation. We keep all the descriptions of the blocks bk in an array 
which is directly accessed. 

The algorithm we obtain is quite the same as in | 2 |. The main differences are 
that we obtain this algorithm as a particular case of a general string search al- 
gorithm for text that comes in blocks, that their algorithm is originally designed 
for LZW compression, and that we search all the occurrences of the pattern, not 
only the first one. Moreover, we present a practical implementation based on 
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bit-parallelism, while |3 is a theoretical work that has not been implemented. 
To our knowledge ours is the first real implementation of this algorithrrQ- K is 
quite easy to adapt our algorithm to work on other variants of LZ78, such as 
LZW or LZMW. In particular we can easily adapt to different window man- 
agement policies. The simplest one is that when the compressor memory is full, 
the dictionary is deleted and compression is restarted. Others try to remove the 
least interesting blocks from the dictionary, e.g. m- Our searcher can follow 
the same steps of the compressor along the search, using the same amount of 
memory. 

4.3 Analysis 

The theoretical complexity of the pattern matching algorithm is 0{n + R) (recall 
that, as we use bit-parallelism, we have 0{mn/w + R) time for long patterns). 
If n = o{u), this is faster than searching in the uncompressed text. In practical 
terms, the algorithm is rather efficient since no extra work apart from one block 
concatenation and one update of the search is performed per element of the 
compressed file. 

Our experimental results, however (Section 0, show that the algorithm takes 
in practice twice the time of a Shift-Or run on the uncompressed text. This is 
because Shift-Or is very simple, and although we process many characters of the 
uncompressed text in one shot, in practice the cost of each step is big enough to 
amortize any possible gain due to compression. A specific problem is the locality 
of reference: the compressed matching algorithm reads random positions in the 
array of block definitions, while the uncompressed algorithm works basically in- 
place. The caching mechanism of the computer largely favors this last approach. 

However, there is a positive result. Searching the compressed file with this 
algorithm is twice as fast as decompressing it and then searching the uncom- 
pressed file. For this comparison we are assuming that the file is compressed 
with LZ77 (which is much faster than LZ78 to decompress) and consider the 
time of gunzip, which is an optimized decompression software. Hence, if the text 
collection is kept compressed (which is definitely of interest) then it is much 
faster to search directly the compressed files. 

We have tried to further improve our algorithm. For instance, we have created 
a variant called Mark-LZ78. In this compression algorithm, we mark with a bit 
flag for each block if the block is a leaf of the dictionary trie or not, to avoid 
storing the block description if this block is not used anymore. However, as we 
show in the experiments, the performance does not improve. 

5 LZ77 Compression 

5.1 Compression Algorithm 

The Ziv-Lempel compression algorithm of 1977 (usually named LZ77 P^) is, 
in some sense, simpler than LZ78, since the basic idea is just to recognize two 



^ See, however, m, in this very same conference. 
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repeated segments of the text and to mark the second as a reference (position in 
the text and length of the repeated part) to the first one. More formally, assume 
that a prefix t\ . . .ti of T has been already compressed in a sequence of blocks 
Z = bi .. .be- We look for the longest prefix v of which appears already 

in . . . Uti+i . . . Once we have it, say that we find it starting at position 

j < i, we add a new block (j, |u|) to the compressed file Z. A special case occurs 
if V is empty, in which case is a new letter and we code it with a special block 
(0,ti+i). With the same example ananas^ we obtained: (0,a) nanas; (0,a)(0,n) 
anas', (0, a)(0, n)(l, 3) s; (0, a)(0, n)(l, 3)(0, s). 

Notice that the above definition allows that the referenced block overlaps 
the one which is being compressed. Another variant avoids this for simplicity, 
i.e. V must be found in In this case the compression of ananas be- 

comes: (0,a) nanas', (0,a)(0,n) anas; (0, a)(0, n)(l, 2) as; (0, a)(0, n)(l, 2)(1, 1) 
s; (0,a)(0,n)(l,2)(l,l)(0,s). 

Yet another variant codes the repeated block and then the letter which follows 
it in the still uncompressed text. There are many other variants as well, mainly 
related to how to represent the pairs in the compressed file and how to compress 
fast. In general, the position j is coded as the difference i+1— j, since the last 
occurrence of the block is used and v is normally restricted to not appear too 
far away from ti. 

LZ77 compresses more than LZ78, both in theory and in practice. From a the- 
oretical point of view, the variant which allows overlaps can obtain a compressed 
file of 0(1) blocks in the best case, while the one not allowing overlaps obtains 
at most O(logu). LZ78, on the other case, cannot obtain less than 0{y/u). This 
is easily seen by considering the best-case file T = a“. In practice it is also true 
that LZ77 compresses more than LZ78. LZ77 is implemented in the Gnu gzip 
program. 

Compression is rather slow with \JL11 . It is expensive in time and space to 
find the longest prefix of the uncompressed part of the file that appears already 
in the compressed part. In theory, the compression is 0(u) in time and space by 
the use of a suffix tree or a DAWG automaton ISHEg. In practice, the search in 
done in a buffer window and an large hash table is normally used, as in gzip. An 
experimental comparison of different techniques to find the prefix can be found 
in p] . The decompression algorithm, on the other hand, is very fast (faster than 
for LZ78) because to decompress a block is it just necessary to copy a part of 
the text and no dictionary has to be kept. 

5.2 Pattern Matching in LZ77 Compressed Files 

Our algorithm for LZ77 is an adaptation of the general algorithm on blocks, with 
a main difference. On LZ77 compressed files, when we want to process a new 
block, the situation shown in Figure 0 generally occurs: the new block references 
a sequence of r contiguous previously processed blocks, but it overlaps with the 
first and last one {u and v in the Figure). That is, the new block does not 
exactly correspond to previously processed blocks. Therefore, we do not have all 
the information on the blocks u and v that we need to concatenate the blocks. 
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We solve this by computing recursively the descriptions of the two blocks 
u and V with the same method. That is, we simulate that we are back in the 
text, where those blocks appeared, and compute their description (this may 
trigger more recursive invocations with the same purpose). When we finally 
obtain the descriptions of u and v, we concatenate all the referenced blocks to 
obtain the description of the new block. Another possibility is that the new block 
is completely inside another block already processed, in which case we have to 
recursively consider the blocks that define the referenced block. 



Blocks already computed 




Fig. 5. Recursive computation of the description of a block in LZ77 compressed 
files. 

We explain now a technique to concatenate the r blocks in low average time. 
Instead of computing Pref(i?) and Suff(i?) of the first block, then concatenating 
with the second, then to the third, until the r blocks are concatenated, we 
compute Suff(R) from the first block to the r-th and Pref(i?) from the r-th 
block to the first one. We analyze this shortly. 



5.3 Analysis and Improvements 

We analyze now the many aspects of our algorithm and propose some improve- 
ments. 

Block concatenation. If we use the proposed block concatenation technique, we 
have that in the worst case only the first m blocks can affect Suff(R) and only 
the last m blocks can affect Pref(R), so the worst case time for concatenating 
the blocks becomes 0(min(u, mn)). 

We show now that on average only O(logm) blocks are processed until 
Suff(i?) becomes stable. Each new block character we process will either ex- 
tend the current suffixes of the set Suff(R) or make them disappear from the 
set. Each suffix is removed from the set with probability 1 — 1/ a (i.e. if the new 
character block cannot extend it). Before we read the block characters all the m 
pattern positions are in Suff(i?), and therefore on average no pattern positions 
remain in the set after 0(log m) block characters are read (after the z-th char- 
acter is read, the pattern positions m — z to m — 1 cannot be removed from the 
set, but their situation cannot change anyway). 

Even if we consider all blocks of length 1 (the worst), we work on average 
O(zzlogm) because of concatenations. The same reasoning holds for Pref(R). 
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The only part of the block concatenation which cannot skip blocks is the 
computation of Matches(B). However, this adds up 0{R) time along all the 
search. Therefore, the total time for block concatenation is 0(min(M, nlogm) + 
R) on average. 

Finding the blocks. We consider now how to find the indices of the block that 
define a text position j. We keep an array with the blocks already seen. Binary 
searching the text position among these blocks adds O(nlogn) to the cost. In- 
stead, we keep a table of 0(n) entries where the element i points to the block 
where the text position [iu/n\ is defined. By accessing this table we directly 
arrive at the correct block with an average inaccuracy of 0(u/n), and a fi- 
nal binary search finds the correct position, for a total cost of 0(n log (w/n)) 
(in practice a linear search is faster for the final part). This gives good re- 
sults in practice. Another alternative is that the compressor does not store the 
text position and length of the repeated part, but instead it gives the block 
numbers involved and the offsets inside u and v. Since a text position needs 
0(log u) bits and a block number plus an offset inside the block needs on aver- 
age [log 2 n'] + |'log 2 (M/n)] = 0(log u) bits, the order of compression ratio should 
not worsen. We show in the experiments that this version of the algorithm (called 
Block-LZ77) is faster than the plain version, since no searching of the text po- 
sition is necessary. However, compression ratios worsen significantly in practice 
due to round-offs. 

Computing partial blocks. However, the really costly part of the algorithm is 
not here, but in the recursive computation of the partial blocks u and v. If we 
consider that each time we perform a recursive call we “split” the original block 
H at a new position, then it is clear that at most \B\ recursive calls can be 
done until we have split it in single characters and therefore we have found the 
definition of each one. This shows that the total cost of the recursive calls is 
0{u) in the worst case. Our experiments suggest that this is also the average 
case, but we were not able to prove it. 

Consider now the cost of the recursive invocations in the case where the 
new block B is strictly inside its referencing block. For instance, a letter which 
repeats inside a large block could trigger a long chain of recursive invocations 
until its real definition is found. In the worst case, we could have a block of 
size s which references one of size s — 1, and this one references another of size 
s — 2, and so on. We would work 0{s), but the size of the text at that point 
is O(s^). Hence, at text position i we cannot work more than y/i, which gives 
a total worst-case cost of 0{ny/u), which is too high. This problem does not 
disappear if the compressor always stores the first occurrence of the repeated 
block instead of the last one, because we may not point to the first occurrence 
when we consider partial blocks. 

Hence the total amount of work is uiu) in the worst case whenever n = 
uj(yd), and we conjecture that this is also the average case. See the left plot of 
Figure El where we have experimented with the English text described in Sec- 
tion 0 Least squares fitting shows that a good model for the number of recursive 
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invocations per text character is 0.177 + 0.1 In it (with less than 0.5% error in 
the approximation). The experiment suggests that the algorithm is O(ulogit) 
on average. This is, unfortunately, worse than uncompressing and searching. We 
present now some techniques to improve this situation. 

Improvements. A first improvement we tried consisted in storing more informa- 
tion than simply one description per block. For instance, when we compute the 
description for the partial blocks u and v (which are not part of the original 
sequence of blocks), we could store instead of discarding them. If later another 
block needs the description of u and v, we have already computed them. Fig- 
ure 0 (right plot) shows that the total amount of recursive calls is reduced using 
this technique, and we conjecture that in this case we work 0(u) (least squares 
fitting yields a complexity of 0 (m‘’ ®®®^^)). These blocks, however, cannot be eas- 
ily stored in the array of blocks since they do not belong to the sequence. A 
hashing implementation gave bad results in practice, that is, the cost to add the 
new blocks outweighted the gains of having them already computed. This could 
change for longer texts, if the orders of the two algorithms are different. 





Fig. 6. Number of recursive invocations (thick line) and block concatenations 
(thin line) per text character, for natural language text. The left plot shows 
the basic algorithm and the right plot shows the improvement of adding the 
computed blocks. 

Another improvement, which gave good practical results, was to try to com- 
pute less (instead of more) information. Our aim was to avoid the recursive com- 
putation of u and v. Hence, instead of computing their descriptions recursively, 
we pessimistically assume that they match all the pattern positions. If they are 
short enough we will not have a match even assuming this, and we could pro- 
cess them without actually obtaining their descriptions. Only when we find a 
(possible) match we backtrack to the point where it could have been started and 
compute correctly the involved blocks. For each block, we store whether it has 
been correctly or pessimistically computed. As we show in the experiments, this 
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improves search time for patterns of length 15 or more in practice. However, the 
method is limited since we cannot skip more than m characters of T without 
having at least one character correctly computed, hence in the very best case we 
pay 0{u/m) with this speedup. We call this algorithm Skip-LZ77 (and combined 
with Block-LZ77 it yields Skip-Block-LZ77). 

Final remarks. Even with all these improvements, the experiments show that 
this algorithm is much slower than decompressing (with gunzip) and search- 
ing (with Shift-Or). Although ours is the first algorithm to directly search in 
LZ77-compressed text, we believe that it is not possible in practice to beat a 
decompress-then-search approach. The root of this limitation lies in the need to 
recursively compute u and v. Another consequence of the existence of partial 
blocks is that, even if the compressor uses a window of fixed size to select the 
strings to repeat, we need to keep in memory all the previous blocks, since even if 
they are not directly referenced anymore, we may need to resort to them in case 
of partial blocks. We propose in the next section a slightly different compression 
scheme which gets rid of all the aspects of LZ77 compression that degrade the 
searching performance. 

We finish this section with a couple of comments. First, as it is clear from the 
algorithm, we do not handle the case of overlapping compression, i.e. when the 
referenced block can overlap with the new block B. Although we could handle 
it, the result is the same in cost as if the compressor avoided such overlapping 
(i.e. performing many steps, where a step ends when an overlap occurs). Second, 
other variants of LZ77 are easily accommodated. Finally, we notice that a neigh- 
borhood of size N around the occurrences can be obtained using the general 
mechanism at 0{N cost (or, according to the empirical results, 0{N\ogu) 
cost). This is because of the cost to find the definitions of the incomplete blocks. 

6 A New Hybrid Compression Algorithm 

It became clear in the previous section that the worst part of the cost of the 
LZ77 search algorithm was due to the cost of recursively computing partial 
blocks, and of finding the block corresponding to a text position. We design 
a new compression algorithm between LZ78 and LZ77, to have multiple-block 
compression (not just one block like in LZ78), but also to avoid the recursive 
situation which appears in searching LZ77-compressed files (Figure [^. 

We propose the following algorithm. Assume that a prefix t\ . . - U of T has 
been already compressed in a sequence of block Z — b\ .. .be- We look now for 
the longest prefix v of ti+\ . . . which is represented by a sequence b^ ■ . . br+h al- 
ready present in the compressed file. If there are many alternative choices for the 
same v, we take the one with the minimum of blocks (to reduce the cost of con- 
catenations). And if still several possibilities occur, we take the first occurrence 
(the minimum in the number of the first block) . We code this new block by (r, h) . 
As in LZ77, if v is empty (i.e the letter is new), we code a special block 
(0,ti+i). With the same example ananas, we obtain: (0,a) nanas; (0,a)(0,n) 
anas] (0, a)(0, n)(l, 1) as] (0, a)(0, n)(l, 1)(1, 0) s; (0, a)(0, n)(l, 1)(1, 0)(0, s). 
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The main advantage of this compression scheme is that it avoids the recursive 
case in the LZ77 pattern matching (Figure 0, because we know already that the 
new block corresponds directly to a concatenation of already processed blocks. 
Moreover, we do not need to search the text position in the blocks, since we can 
directly access the relevant blocks. 

The compression can still be performed in 0{u) time by using a sparse sufhx 
tree uni where only the block beginnings are inserted and when we fall out 
of the trie we take the last node visited which corresponds to a block ending. 
Decompression is slower than for LZ77, since we need to keep track of the blocks 
already seen to be able to retrieve the appropriate text. Finally, the compression 
ratio is in principle worse than for LZ77 since we are limited in the text segments 
that we can use. On the other hand, the numbers to code are smaller since we 
code block positions in O(logn) bits instead of text positions in O(logu) bits. 
Moreover, if we use a simple trick, the compression is in general better than for 
LZ78 since we are not limited to using just one block. The trick is to represent 
the pairs (r, 0) as (2r), and the pairs (r, h+1) as (2r + l, h). This pays off because 
the second element of the pair is frequently zero. 

The searching algorithm is like that of LZ77 except because we do not need 
to search for the blocks and we do not have to recursively find the partial blocks 
u and V (they simply do not exist now). From the analysis of the LZ77 pattern 
matching algorithm we have that we work 0(min(u,nlog77i) + R) on average 
and 0(min(u, mn) + R) in the worst case (thanks to the improved algorithm 
to concatenate blocks). In practice, this algorithm performance is very close to 
LZ78 pattern matching. We also tried a marked version (called Mark-Hybrid) 
where for each block a bit is stored which tells whether or not the block will be 
used again, but as for LZ78, the search time does not improve in practice. 

Unlike LZ77, we can use less memory if the compressor restricts the references 
to a window of the text. Since there are no recursive references, those blocks 
which are far away in the past need not be stored since they will not be referenced 
anymore. Hence, as in LZ78, we need the same memory as the compressor. A 
window of size N can be displayed in 0{N) time. 



7 Experimental Results 

We show in this section our empirical results on the behavior or our search and 
compression schemes. We first study the compression techniques and later the 
search performance. 

We use mainly two files for the experiments. One is an English literary text 
(from B. Franklin) of 1.29 Mb, filtered to lower-case and with separators normal- 
ized. The other is the DNA chain of “h. influenzae”, of 1.36 Mb. For comparative 
purposes, we also show the results on some files of the the Calgary Corpu^: two 
books (book*), six troff-formatted scientific articles (paper*) and three source 
program codes (prog*). 

ftp: //ftp . cpsc .ucalgary . ca/pub/project s/text . compression . corpus/ 
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7.1 Compression Performance 

It is interesting to study the compression performance of the algorithms for two 
reasons: first, we propose a hybrid compression scheme which we have to evaluate 
in terms of compression ratios. Second, our search algorithms use a technique 
to code the pairs which speeds up search time but which is suboptimal: the 
numbers are stored in as many bytes as needed (using the highest bit to denote 
if there are more bytes or not). 

We first compare the number of bits needed to code a file with our hybrid 
compression scheme against the same number for LZ77 and LZ78. We call this 
approach “bit-coding” . This is aimed to give and idea of the expected compres- 
sion performance when the file is compressed with a real technique (such as Elias 
pn i or Huffman codes). Many other improvements are possible. A deeper study 
of the best techniques for our hybrid compressor is deferred for future study. 

Tabled shows the results. The “Ideal” column counts exactly the bits used 
by each number stored in the compressed file, while both “Elias” columns count 
the number of bits needed to represent the numbers using these code^ d3- The 
letters, on the other hand, are Huffman coded. For English and DNA we show 
in a second line the percentages for different variants of the compressors: Block- 
LZ77, Mark-LZ78 and Mark-Hybrid, respectively. With our Hybrid compression 
method, we obtain estimated compression ratios comparable to LZ77. The Hy- 
brid and LZ77 compression is better than LZ78 except for DNA, where only two 
bits are necessary to code a letter. Block-LZ77, on the other hand, compresses 
quite badly. 

We now perform a practical comparison using our byte-coding techniques 
against good LZ77 and LZ78 compressors, namely gzip and Compress respec- 
tively. This is to show how much compression are we loosing in order to ease the 
searching process. 

Table Q shows the compression ratios achieved. The percentages in the sec- 
ond row of English and DNA have the same meaning as before. Interestingly, 
Compress is better than gzip on DNA, which rarely happens on natural language 
texts. Our compression ratios show a penalty with respect to those of gzip. Our 
byte compression method is very simple, and these results show in which pro- 
portion our compression ratios could be improved by engineering techniques, 
keeping in mind that complicating the encoding of the numbers risks slowing 
down the pattern matching process. 

7.2 Search Algorithms 

We compare now the search time for our algorithms against the decompressing 
and searching approach. The experiments were run on a Sun UltraSparc-1 of 167 
MHz, with 64 Mb of RAM, running Solaris 2.5.1. We consider user time, which 
is within 2% of accuracy with 95% confidence. Time is expressed in seconds 
everywhere in this section. 

Recall that Elias-y precedes the number x by its length in unary, while Elias-5 uses 
Elias-y to code that length that precedes the number. 
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File 


Size 

(Kb) 


Ideal 


Elias-y 


Elias- (5 1 


LZ77 


LZ78 


Hybrid 


LZ77 


LZ78 


Hybrid 


LZ77 


LZ78 


Hybrid 


English 


1,324 


29.67% 


36.15% 


29.28% 


59.34% 


64.01% 


58.57% 


48.96% 


52.04% 


46.17% 






5245% 


38.01% 


31.24% 


104 . 9 % 


82.31% 


62.48% 


74 . 25 % 


54 . 71 % 


48.75% 


DNA 


1,390 


28.03% 


25.30% 


29.08% 


56.06% 


47.33% 


58.18% 


45.77% 


37.71% 


46.40% 






47.21% 


26. 77% 


31.15% 


94 . 43 % 


67.62% 


62.30% 


73 . 14 % 


39.91% 


49 . 03 % 


bookl 


751 


34.10% 


40.70% 


35.62% 


68.20% 


70.83% 


71.25% 


41.26% 


44.96% 


41.50% 


book2 


597 


29.33% 


40.21% 


30.44% 


58.66% 


69.46% 


60.89% 


35.51% 


44.41% 


35.72% 


paper 1 


52 


32.33% 


46.20% 


34.29% 


64.53% 


77.01% 


68.59% 


41.05% 


51.92% 


41.91% 


paper2 


80 


32.68% 


43.00% 


34.80% 


65.27% 


72.84% 


69.60% 


41.08% 


48.28% 


42.01% 


paper3 


45 


35.10% 


45.50% 


38.12% 


70.07% 


76.23% 


76.24% 


44.84% 


51.36% 


46.55% 


paper4 


13 


37.60% 


47.95% 


41.07% 


74.74% 


78.30% 


82.15% 


49.92% 


54.81% 


51.55% 


papers 


12 


39.85% 


50.79% 


41.74% 


79.13% 


82.42% 


83.49% 


52.63% 


57.92% 


52.39% 


paper6 


37 


33.60% 


47.72% 


35.69% 


67.03% 


79.08% 


71.38% 


42.91% 


53.72% 


43.81% 


progc 


39 


32.21% 


47.99% 


34.16% 


64.24% 


79.14% 


68.32% 


41.24% 


53.96% 


41.95% 


progl 


70 


22.45% 


39.10% 


23.30% 


44.82% 


65.83% 


44.92% 


28.04% 


43.85% 


27.65% 


progp 


48 


21.34% 


40.36% 


22.46% 


42.54% 


66.95% 


46.60% 


27.16% 


45.33% 


28.46% 



Table 1. Estimated compression ratios with three different methods. For each 
number in the compressed file, if we note n the bits needed to code it, then Ideal 
counts only n, Elias-'j counts 2n and Elias-5 counts n + 2 [log 2 n] . The second 
line (in italics) of English and DNA correspond to Block-LZ77, Mark-LZ78 and 
Mark-Hybrid, respectively. 



File 


gzip 


Compress 


Byte-LZ77 


Byte-LZ78 


Byte-Hybrid 


English 


35.58% 


38.90% 


44.49% 


54.41% 


43.29% 








79.32% 


56.20% 


45 . 24 % 


DNA 


30.44% 


27.96% 


41.07% 


43.17% 


42.23% 








75 . 24 % 


44-90% 


44-22% 


bookl 


40.76% 


43.19% 


53.21% 


59.92% 


53.30% 


book2 


33.83% 


41.05% 


45.60% 


58.55% 


46.53% 


paper 1 


34.94% 


47.17% 


54.70% 


66.17% 


52.67% 


paper2 


36.19% 


43.99% 


54.65% 


62.02% 


52.10% 


paper3 


38.89% 


47.63% 


60.19% 


67.92% 


58.75% 


paper4 


41.66% 


52.36% 


69.20% 


75.71% 


68.24% 


papers 


41.78% 


55.04% 


72.27% 


79.47% 


68.16% 


paper6 


34.72% 


49.06% 


56.84% 


69.33 % 


54.76% 


progc 


33.51% 


48.32% 


54.97% 


67.99% 


51.95% 


progl 


22.71% 


37.89% 


37.82% 


55.30% 


35.47% 


progp 


22.77% 


38.90% 


35.97% 


57.20% 


34.20% 



Table 2. Compression ratios for classical compressors and our byte versions. The 
second (italics) lines of English and DNA correspond to Block-LZ77, Mark-LZ78 
and Mark-Hybrid, respectively. 
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In general, searching a compressed text has the additional advantage over 
the uncompressed text that it performs less I/O. However, this is relevant if 
we compare compressed versus uncompressed searching. This is not what we 
compare here: we consider that the text is always compressed. Hence, we measure 
the cost of searching it without decompressing versus the cost of decompressing 
it and then searching. Clearly the last task can be done using an intermediate 
buffer in main memory, and therefore the I/O is the same in both cases. 

FigureQcompares the marked and unmarked versions of LZ78 and the Hybrid 
compressor. As it can be seen, there is no advantage in practice by the use of 
marking. Therefore, we do not further consider the marked versions. Another 
conclusion we take from the figure is that the searcher for Hybrid compression 
is slightly faster than for LZ78 on English but slower for DNA. This may be 
related to the good performance of the LZ78 compressor on DNA. 





5 10 15 20 25 30 

— Hybrid 

Mark-Hybrid 

Fig. 7. Comparison between the marked and unmarked versions of LZ78 and 
Hybrid compressors. The left plot is for English text and the right one for DNA. 



Figure IHI compares all the search algorithms together, as well as decompres- 
sion (with gunzip) plus search time (with Shift-Or and BNDM |2S|, a bit-parallel 
searcher which is the fastest in practice together with m)- It can be seen that 
Block-LZ77 improves significantly over LZ77, and that the Skip-LZ77 versions 
improve as the pattern length grows. However, all the LZ77 search algorithms 
are not competitive against decompressing and searching, especially on DNA. 
On the other hand, both the Hybrid and LZ78 search algorithms are twice as 
fast as decompressing and searching. 

Table El compares the time to search a random 10-letter pattern on English, 
DNA and the selected files of the Calgary Corpus. We consider the time to 
decompress with gunzip and to search with Shift-Or (as seen, for m = 10 the 
time is very close to BNDM). We show the results for LZ78 and Hybrid only, as 
LZ77 has been shown to be much inferior. 
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^ LZ77 ^ Block-LZ77 

Skip-LZ77 -X- Skip-Block-LZ77 



3.2 
3.0 
2.8 
2.6 
2.4 

2.2 

5 10 15 20 25 30 




— LZ78 — gunzip + Shift-Or 

— Hybrid --gunzip + BNDM 




Fig. 8. Comparison of the search algorithms. The dotted line is the time taken 
by gunzip alone. The left plot is for English text and the right one for DNA. 



8 Conclusions 

We have focused in the problem of string matching on Ziv-Lempel compressed 
text. This is an important practical problem, as it is of interest keep the texts 
compressed and at the same time being able to efficiently search on them. 

We presented a general paradigm to search in a text that is expressed as a 
sequence of blocks, which abstracts the main features of Ziv-Lempel compression. 
Then, we applied the technique to the different variants, i.e. LZ77 and LZ78. For 
LZ78, we are able to search in half the time of uncompressing and searching, while 
for LZ77 our algorithm, although much slower, is the first one proposed to search 
on LZ77 compressed text. This motivated us to present a new hybrid compression 
technique which allows to search as fast as in LZ78 but which keeps many of the 
features of LZ77 compression, being in practice similar in compression ratios. 

Therefore, we are able to search in a compressed text faster than uncompress- 
ing and then searching. In general, on the other hand, searching on compressed 
text at the same speed of on uncompressed text seems difficult to achieve in 
practice because of a basic problem of locality of reference. 

Future work involves studying better the performance of our hybrid com- 
pression, both in theory and in practice (especially on finding better methods to 
encode the numbers while keeping the good search times) . We also plan to work 
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File 


gunzip 


Shift-Or 


LZ78 


Hybrid 


English 


28.80 


8.90 


17.24 (45.7%) 


16.65 (44.2%) 


DNA 


28.10 


9.21 


15.10 (40.5%) 


17.27 (46.3%) 


bookl 


18.40 


4.92 


10.91 (46.8%) 


11.42 (49.0%) 


book2 


12.40 


4.14 


8.01 (48.4%) 


7.78 (47.0%) 


paper 1 


1.80 


1.67 


1.88 (54.2%) 


1.92 (55.3%) 


paper2 


2.40 


1.76 


2.07 (49.8%) 


2.18 (52.4%) 


paper3 


1.80 


1.60 


1.73 (50.9%) 


1.88 (55.3%) 


paper4 


1.20 


1.48 


1.50 (56.0%) 


1.59 (59.3%) 


papers 


0.80 


1.42 


1.52 (68.5%) 


1.54 (69.4%) 


papers 


1.90 


1.53 


1.69 (49.3%) 


1.78 (51.9%) 


progc 


1.50 


1.55 


1.73 (56.7%) 


1.75 (57.4%) 


progl 


1.90 


1.72 


1.88 (51.9%) 


1.84 (50.8%) 


progp 


1.20 


1.62 


1.74 (61.7%) 


1.70 (60.3%) 



Table 3. Search times for different files, in 1/ 100-th of seconds. The percentages 
indicate the time of the compressed searching as a fraction of uncompressing plus 
Shift-Or searching. 



more in understanding the behavior of the LZ77 search algorithm. Finally, we 
plan to allow for more flexible search, including features such as allowing classes 
of characters and Hamming errors (some work has been already done in I2til ) . 

This is a field where important theoretical and practical development is nec- 
essary, and we have presented new results in both aspects. We hope that more 
improvements are to come. 
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Abstract. In this paper we focus on the problem of compressed pattern 
matching for the text compression using antidictionaries, which is a new 
compression scheme proposed recently by Crochemore et al. (1998). We 
show an algorithm which preprocesses a pattern of length m and an 
antidictionary M in 0{m? + ||JJ||) time, and then scans a compressed 
text of length n in 0(n -\- r) time to find all pattern occurrences, where 
||M|| is the total length of strings in M and r is the number of the pattern 
occurrences. 



1 Introduction 

Compressed pattern matching is one of the most interesting topics in the com- 
binatorial pattern matching, and many studies have been undertaken on this 
problem for several compression methods from both theoretical and practical 
viewpoints. See Table Q One important goal of compressed pattern matching is 
to achieve a linear time complexity that is proportional not to the original text 
length but to the compressed text length. 

Recently, Crochemore et al. proposed a new compression scheme: text com- 
pression using antidictionary |H|. Contrary to the compression methods that 
make use of dictionaries, which are particular sets of strings occurring in texts, 
the new scheme exploits an antidictionary that is a finite set of strings that do 
not occur as factors in text, i.e. that are forbidden. Let a\ . . .an G {0, 1}'*' be 
the text to be compressed. Suppose we have read a prefix ai . . .aj at a certain 
moment. If the string ai . . .ajb {i < j, b G {0,1}) is a forbidden word, namely, 
is in the antidictionary, then the next symbol Oj+i cannot be b. In other words, 
the next symbol aj+i is predictable. Based on this idea, the compression method 
removes such predictable symbols from the text. The compression and the de- 
compression are performed by using the automaton accepting the set of strings 
in which no forbidden words occur as factors. 

In this paper we focus on the problem of compressed pattern matching for 
the text compression using antidictionaries. We present an algorithm that solves 
the problem in 0{m^ \\M\\ r) time using 0{m? -\- ||M||) space, where m 

and n are the pattern length and the compressed text length, respectively, ||M|| 



M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. 37-|52| 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Table 1. Compressed pattern matching. 



compression method compressed pattern matching algorithms 
Ei1a,m-Tzoreff a.nd Vishkin (11j 



run-length 

run- length (two dim.) 

LZ77 

LZW 

straight-line program 
Huffman 

finite state encoding 
word based encoding 
pattern substitution 



Amir, Landau, and Vishkin p]; Amir and Benson 
|2l,‘-{) : Amir, Benson, and Farach 0 
Farach and Thorup m-, Gcisieniec, Karpinski, 
Plandowski, and Rytter d 

Amir, Benson, and Farach [3|; Kida, Takeda, Shi- 
nohara, Miyazaki, and Arikawa m, Kida, Takeda, 
Shinohara, and Arikawa m 

Karpinski, Rytter, and Shinohara d; Miyazaki, 
Shinohara, and Takeda m 

Fukamachi, Shinohara, and Takeda d; Miyazaki, 
Fukamachi, Takeda, and Shinohara d 
Takeda m 

Moura, Navarro, Ziviani, and Baeza-Yates 
Manber d; Shibata, Kida, Fukamachi, Takeda, 
A. Shinohara, T. Shinohara, Arikawa |23 



denotes the total length of strings in antidictionary M, and r is the number of 
pattern occurrences. Since M is a part of the compressed representation of text, 
the text scanning time is 0{\\M\\+n + r), which is linear in the compressed text 
length ||M|j -|- n, when ignoring r. Moreover, in the case where a set of text files 
share a common antidictionary jSj, we can regard the 0(||M||) time processing 
of M as a preprocessing. Then the 0{n + r) time text scanning will be fast in 
practice. The proposed algorithm thus has desirable properties. 

2 Preliminaries 

Strings x, y, and z are said to be a prefix, factor, and suffix of the string u = xyz, 
respectively. The sets of prefixes, factors, and suffixes of a string u are denoted 
by Prefixiu), Factor{u), and Suffix{u), respectively. A prefix, factor, and suffix 
of a string u is said to be proper if it is not u. The length of a string u is denoted 
by |u|. The empty string is denoted by e, that is, |e| = 0. The fth symbol of a 
string u is denoted by u[i] for 1 < f < |u|, and the factor of a string u that begins 
at position i and ends at position j is denoted hy u[i : j] for 1 < i < j < |u|. 
The reversed string of a string u is denoted by . The total length of strings 
of a set S is denoted by US'!!. For strings x and y, denote by Occ{x,y) the set of 
occurrences of x in y. That is, 

Occ{x,y) = {|x| < i < |?/| \ x = y[i-\x\ + l-. i]}. 

The next lemma follows from the periodicity lemma. 

Lemma 1. If Occ{x,y) has more than two elements and the difference of the 
maximum and the minimum elements is at most \x\, then it forms an arithmetic 
progression, in which the step is the smallest period of x. 
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Fig. 1. Automaton A{M) for M = {0000,111,011,0101,1100}. Circles and 
squares denote the final and the nonfinal states, respectively. Shaded circles 
denote the predict states. 



3 Text Compression Using Antidictionary 

In this section we describe the text compression scheme recently proposed by 
Crochemore et al. [8]. 



3.1 Method 

Let B = {0, 1}. Suppose that T e be the text to be compressed. A forbidden 
word for T is a string u € B+ that is not a factor of T. A forbidden word is said 
to be minimal if it has no proper factor that is forbidden. An antidictionary for 
T is a set of minimal forbidden words for T. 

Let M be an antidictionary for T. Then the text T is in the set B*\B* MB* . 
The automaton accepting the set B*\B* M B* can be built from M in 0(||M||) 
time in a similar way to the construction of the Aho-Corasick pattern matching 
machine [1]. We denote the automaton by 

A{M) = {Q,B,6,e,M), 

where Q = Prefix{M) is the set of states; B is the alphabet; 6 is the state 
transition function from Q x B to Q defined as 

, J M, if M G M; 

' ’ { longest string in Q fl Suffix{ua), otherwise; 

e is the initial state; M is the set of final states. Figure 1 shows the automaton 
A{M) for M = {0000,111,011,0101,1100}, which is an antidictionary for text 
T = 11010001. 
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The encoder and the decoder in this compression scheme are obtained di- 
rectly from the automaton A{M). The encoder E{M) is a generalized sequential 
machine based on A{M) with output function X : Q x B defined by 

[ e, otherwise, 

where Deg{u) = |{a € B\S{u,a) ^ M}\. The decoder T>{M) is a generalized 
sequential machine obtained by swapping the input label and the output label 
on each arc of the encoder £{M). Figure 0 illustrates the move of the encoder 
£{M) based on A{M) of Fig. 1 which takes as input T = 11010001 and emits 
110. It should be noted that, any prefix of 1101000100 with length greater than 
6 is compressed into the same string 110. For a decompression we therefore need 
the length of T together with the encoded string itself. Formally, the compressed 
representation of T is a triple {M,bi . ..bn, N), where M is an antidictionary, 
bi . . .bn is output from the encoder, and N is the length of T. 

Let us denote by MF{T) the set of all minimal forbidden words for T. In 
the case of binary alphabet we have \MF{T)\ < 2 • |T| — 2 as shown in [ 7 |. To 
shorten the representation size of the above triple, we need a way to build a 
‘good’ antidictionary as a subset of MF{T). Crochemore et al. presented in |S| a 
simple method in which antidictionary is the set of forbidden words of length at 
most k, where fc is a parameter. It is reported in P] that the compression ratio 
in practice is comparable to pkzip. 



input: 11 0 10001 

state: 0^9^10^11^5^6^2^3^5 

output: 11 £ £££0 e 

Fig. 2. Move of encoder £{M) for T = 11010001. 



3.2 Decoder without e-Moves 

Note that the decoder T>{M) mentioned above has s-moves. For a simple presen- 
tation of our algorithm, we shall define a generalized sequential machine Q{M) 
obtained by eliminating the £-moves from the decoder T>{M). 

Let us partition the set Q into four disjoint subsets M, Qq, Qi, and Q2 by 

Qi = {u £ Q\M I Deg{u) = i} {i = 0, 1, 2). 

A state p in Q I is called a predict state because of the uniqueness of outgoing 
arc when ignoring the arcs into states in M . Namely, there exists exactly one 
symbol a such that 6{p,a) ^ M. We denote such symbol a by NextSymbol{p), 
and denote by NextState{p) the state 6{p,a). 

Consider, for p G Qi, the sequence pi,P 2 , ... of states in Qi defined by pi = p 
and pi+i = NextState{pi) {i = 1,2,...). There are two cases: One is the case that 
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there exists an integer m > 0 such that, for i = 1,2, ... ,m — 1, pi G Qi, and 
Pm C Qo UQ 2 . The other is the case of no such integer m, namely, the sequence 
continues infinitely. Let us call the sequence the predict path of p, and denote by 
Terrninalij)) the last state pm- In the infinite case, let Terminal{p) =_L, where 
_L is a special state not in Q. (Therefore, Terminal(p) G Qo U Q 2 U {T}.) The 
finite/semi-infinite string spelled out by the predict path of p G Qi is denoted 
by Sequence{p) . It is easy to see that: 

Lemma 2. For any p G Qi, there exist u,v G B* with \uv\ < |Qi| such that 

Sequenceijp) = u v v ■ ■ ■ . 

Now we are ready to define a generalized sequential machine Q{M), where 
the set of states is Q 0 UQ 2 U {-L}; the state transition function is Sg : Q 2 x B ^ 
Qo U Q 2 U {T} defined by 

^ \ _ j Terminal{6{u, a)), 6{u,a) G Qi’, 

g[u,a) otherwise; 

the output function is Xg : Q 2 x B U B^ defined by 



Xg{u, a) 



j a ■ Sequence{S{u, a)), 6{u, a) G Qi; 
( a, otherwise, 



where B°° denotes the set of semi-infinite strings over B. Figure 3 shows the 
decoder Q{M) obtained in this way from the automaton A{M) of Fig. 1. 

Decompression algorithm using Q{M) is shown in Fig. 4. It should be em- 
phasized that, if the decoder Q{M) enters a state q and then reads a symbol a 
such that Xg{q,a) is semi-infinite, the symbol is the last symbol of the output 
from the encoder £{M). In this case the decoder Q{M) halts after emitting an 
appropriate length prefix of Xg(q,a) according to the value of N. 



4 Main Result 

Generally, most of text compression methods can be recognized as mechanisms 
to factorize a text into several blocks as T = uiU 2 ...u„ and to store a se- 
quence of ‘representations’ of blocks w*. In the LZW compression, for example. 



0/0100 
0/0 1/100 




Fig. 3. Decoder g{M) for M = {0000,441,011,0101,1100}. 
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Input. A compressed representation {M,bi...bn, N) of a text T = T[1 : N], 

Output. Text T. 

begin 

£ := 0 ; 
q := e; 

for i := 1 to n — 1 do begin 

u ■- \g{q,bi); 
q := Sg{q,bi)-, 

£■.= £+ |m|; 
print u 

end; 

u := Xg{q,bn)-, 

print the prefix of u with length N — £ 

end. 



Fig. 4. Decompression by Q{M). 



the representation of a block Ui is just an integer which indicates the node of 
dictionary trie representing the string Ui. In the case of the compression using 
antidictionaries, the way of representation of block is slightly complicated. 

Consider how to simulate the move of the KMP automaton for a pattern V 
running on the uncompressed text T. Let ^kmp : {0, 1, ■ ■ • , m}xi? ^ {0,1,..., m} 
be the state transition function of the KMP automaton for 7^ = : to]. We 

extend <5 kmp to the domain {0, 1, . . . , to} x B* in the standard manner. We also 
define the function Akmp on {0,1,..., to} x B* by 

AkmpO, u) = {l < i < |u| I 7^ is a suffix of string 7^[1 : }] • u[l : *] }. 

We want to devise a pattern matching algorithm which takes as input a sequence 
of representations of blocks ui,U 2 , ■ ■ ■ ,Un of T and reports all occurrences of V 
in T in 0(n + r) time, where r = |Occ(7^,T)|. Then we need a mechanism for 
obtaining in 0(1) time the value 5kmp( 7, w) and a linear size representation of 
the set Akmp (j, u) ■ In the case of the LZW compression such mechanism can be 
realized in 0{rn?-\-n) time using 0{rn?-\-n) space as stated in ^ and ca. Similar 
idea can also be applied to the case of text compression by antidictionaries, 
except that block Ui, which will be an input to the second arguments of i5kmp 
and Akmp, is represented in a different manner. 

In our case a block Ui is represented as a pair of the current state q of Q{M) 
and the first symbol hi of ut. Therefore we have to keep the state transitions of 
G{M). An overview of our algorithm is shown in Fig. 0 The algorithm makes 
Q{M) run on 6i . . . to know inputs ui,U 2 ,...,Un to the KMP automaton 
being simulated. Figure 6 illustrates the move of the algorithm searching the 
compressed text 110 for the pattern V = 0001. 

We have the following theorems which will be proved in the next section. 
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Input. A compressed representation (M, bib 2 ...b„, N) of a text T = T[1 : N], 
and a pattern "P = "P[l : m]. 

Output. All positions at which V occurs in T. 

begin 

/* Preprocessing */ 

Construct the KMP automata and the suffix tries for V and 
Construct the automaton A{M) from M; 

Construct the predict path graph from A{M); 

Perform the processing required for 5g, (5kmp, and Akmp (See Section 5.); 

/* Text scanning */ 
t := 0 ; 
q ■- e; 
state := 0; 

for i := 1 to n — 1 do begin 

u ■- Xg{q,bi)- 
q ■- Sg{q,bi)- 

for each p £ Xkmp {state, u) do 

Report a pattern occurrence that ends at position 1 + p •, 
state ~ (5kmp (•state, u); 
f :=f+|u| 

end; 

u ■- Xg{q,b„)- 

for each p £ Akmp (state, u) such that £ + p < N do 

Report a pattern occurrence that ends at position i + p 

end. 



Fig. 5. Pattern matching algorithm. 



Theorem 1. The function which takes as input {q,a) G Q 2 x B and returns 
in 0(1) time the value Sg{q,a), can he realized in 0(||M||) time using 0(||M||) 
space. 

Theorem 2. The function which takes as input a triple {j, q, a) G {0, . . . , m} x 
Q 2 X B and returns in 0(1) time the value 

<5kmp(j,m) {u = Xg{q,a)), 

can he realized in 0(||M|| + wf) time using 0(||M|| + mf) space. 

Theorem 3. The function which takes as input a triple {j, q, a) G {0, . . . , m} x 
Q 2 X B and returns in 0(1) time a linear size representation of the set 

Akmp(j,m) {u = Xg{q,a)), 

can he realized in 0(||M|| + mf) time using 0(||M|| + wf) space. 

Then we have the following result. 
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input : 110 

state of Q{M) : 0 — > 9 — > 2 — > 2 

u : 1 10100 0100 

state of KMP automaton : 0 — > 0 > 2 — > 2 

output : 0 0 {8} 

Fig. 6. Move of pattern matching algorithm when T = 110100010 and V = 0001. 



Theorem 4. The problem of compressed pattern matching for the text compres- 
sion using antidictionaries can he solved in 0(||M|| + n + m? + r) time using 
0(||M||+m^) space. 

5 Algorithm in Detail 

This section gives a detailed presentation of the algorithm to prove Theorems 1, 
2, and 3. 

5.1 Proof of Theorem 1 

For a realization of 5g, we have to find, for each g € Qo U Q 2 U {-L}, the pairs 
(p, b) e Q 2 X B such that 6{p,b) = p' € Qi and Terminal{p') = q. First of all, 
we mention the graph consisting of the predict paths, which plays an important 
role in this proof. 

Consider the subgraph of A(M) in which the arcs are limited to the outgoing 
arcs from predict nodes. We add auxiliary nodes v = {p,b) and new arcs labelled 
b from V to q e Qi such that p € Q 2 , b & B, and 6{p, b) = q to the subgraph. 
We call the resulting graph predict path graph. Figure 7 shows the predict path 
graph obtained from A{M) of Fig. 1. 

The predict path graph illustrates, for {p,b) & Q 2 x B, the string Xg{p,b) 
as a path which starts at the auxiliary node (p, 6), passes through nodes in Qi, 
and either finally encounters a node in Q 0 UQ 2 , or flows into a loop consisting 
only of nodes in Q 1 . A connected component of the predict path graph falls into 
two classes: (a) a tree which has as root a node in Qo U Q 2 and has as leaves 




Fig. 7. Predict path graph. Rectangles denote the auxiliary nodes. 
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Fig. 8. Connected components of predict path graph. 



auxiliary nodes, and (b) a loop with trees, each of which has as root a node on 
the loop and has leaves auxiliary nodes. See Fig. El 

Now we are ready to prove Theorem 1. Construction of 5q is as follows: First, 
we set 5g{p, b) — 6{p, b) for every (p, b) G Q 2 x B with 6{p, b) G Qo U Q 2 - Next, 
for every node g G Qo U Q 2 of the predict path graph, we traverse the tree that 
has q as root. Note that the leaves of the tree are auxiliary nodes (p, 6) such 
that Terminal{5{p, b)) = q, and we can set Sg{p, b) = q. Finally, for every node q 
on loops of the predict path graph, we traverse the tree that has q as root. The 
leaves of the tree are auxiliary nodes (p, 6) such that Terminal{S{p,b)) =T, and 
hence we set Sg{p,b) =T. The total time complexity is linear in the number of 
nodes of the predict path graph, i.e. 0(||M||). The proof is now complete. 

5.2 Proof of Theorem 2 

In the following discussions, we are frequently faced with the need to get some 
value as a function of u, the strings that are spelled out by the paths from 
auxiliary nodes. Even when the value for each path can be computed in time 
proportional to the path length, the total time complexity is not 0(||M|j) since 
more than one path can share common arcs. 

Suppose that the value for each path can be computed by making an au- 
tomaton run on the path in the reverse direction. Then, we can compute the 
values for such paths by traversing every tree in the depth-first-order using a 
stack. Since this method enables us to ‘share’ the computation for a common 
suffix of two strings, the total time complexity is linear in the number of arcs, 
i.e. 0(||M||). This technique plays a key role in the following proofs. 

For an integer j with 0 < j < m and for a factor u of V, let us denote by 
Ni{j, u) the largest integer k with 0 < k < j such that V[j — k + 1 : j] • rt is a 
prefix of V. Let Ni(j,u) = nil, if no such integer exists. Then, we have: 

e / ■ \ f u) + |m|, if u is a factor of V and Ni{j, u) ^ nil] 

= otherwise. 

We assume that the second argument u of iVi is given as a node of the suffix 
trie for V. Amir et al. El showed the following fact. 

Lemma 3 (Amir et al. 1996). The function which takes as input {j,u) G 
{0,...,to} X FactorlfP) and returns the value Ni{j,u) in 0(1) time, can be re- 
alized in O(to^) time using 0{m^) space. 



46 



Yusuke Shibata et al. 



We have also the next lemma. 

Lemma 4. The function which takes as input (g, a) € Q 2 x B and returns 
u — \g{q,a) as a node of the suffix trie for V when u S Factor{V), can be 
realized in 0(||M|j + mf) time using 0(||M|j + m?) space. 

Proof. We use the technique mentioned above. We can ignore the infinite strings. 
That is, we can ignore the trees in which a root is on a loop. Consider the problem 
of determining whether is a factor of V^. It can be solved in 0(min{|rt|, to}) 
time using the suffix trie for . If is a factor of , the node u of the suffix 
trie for V can be determined directly from the node of the suffix trie for 
assuming a trivial one-to-one mapping between the two suffix tries, which can 
be computed in 0{mf) time. 



Lemma 5. The function which takes as input (g, a) S Q 2 x B such that u = 
Xg{q,a) is finite and returns in 0(1) time the value (5kmp(0,u), can be realized 
in 0(||M|| -I- to) time using 0(|jM|| -|- to) space. 

Proof. We use the technique mentioned above again. We have to consider the 
problem of finding the length of longest suffix of u that is also a prefix of V. This 
is equivalent to finding the length of longest prefix of that is also a suffix of 
V^. It is solved in 0(min{|M|, to}) time using the suffix tree for V^. We can 
ignore the trees in which a root is on a loop. 

Theorem El follows from the lemmas above. 



5.3 Proof of Theorem 3 

According to whether a pattern occurrence covers the boundary between the 
strings V [1 : j] and u, we can partition the set Akmp (j, u) into two disjoint 
subsets as follows. 

AKMp(j, w) = AKMp(j, m) U A(m), 

where 

X{u) = {\V\ < i <\u\ I P is a suffix of u[l : i]}, 
and u is the longest prefix of u that is also a proper suffix of V. Let 

i) = Occ{V, P)! : j] ■ V[m — i+l \ to]) 0 j, 

where 0 denotes the element-wise subtraction. It is easy to see Akmp(j, u) = 
y(}, juj). It follows from LemmaGlthat the set Y{j,£) has the following property: 



Lemma 6. IfY(j,£) has more than two elements, it forms an arithmetic pro 
gression, where the step is the smallest period of V . 



Pattern Matching in Text Compressed by Using Antidictionaries 



47 



Lemma 7. The function which takes as input (j, £) G {0, . . . , m} x {0, . . . , m} 
and returns in 0(1) time an 0(1) space representation of the set Y(j,£), can be 
realized in 0{mf) time using 0{mf) space. 

Proof. It follows from LemmaElthat Y (j, £) can be stored in 0(1) space as a pair 
of the minimum and the maximum values in it. The table storing the minimum 
values of Y{j,t) for all (j,£) can be computed in O(m^) time as stated in 
(Table N 2 defined in ^ satisfies min(y(j, ^)) = m — N 2 {j,£).) By reversing the 
pattern V, the table the maximum values is also computed in 0{mf) time. The 
smallest period of V is computed in 0{m) time. 



Lemma 8. The function which takes as input (g, a) G Q 2 x B and returns in 
0(1) time the value |u| with u = Ag(g,a), can be realized in 0(||M|| + m) time 
using 0(||M|j + m) space. 

Proof. We shall consider the problem of finding the length of longest suffix of 
that is also a proper prefix of 7^'^. This can be solved by using the KMP 
automaton for V^. But we have to consider the case where u is semi-infinite. 
In the finite string case, we make the automaton start at the root of tree with 
initial state. But in the infinite string case, we must change the value of the 
initial state. Let v be the string spelled out by the loop starting at the root of 
the tree being considered. We must pay attention to the case where a pattern 
suffix is also a prefix of the string with ^ > 0. To determine the correct value 
of the initial state at the root node, we make the automaton go around the loop 
exactly £ times and stop it at the root node that is the starting point, where 
£ is the smallest integer with £ ■ \v\ > \V\. The state of the automaton at that 
moment is the desired value. 



Lemma 9. The function which takes as input {q, a) G Q 2 x B and returns in 
0(1) time a linear size representation of the set X{u) with u = Xg{q,a), can be 
realized in 0(||M|| -|- to) time using 0(||M|| -|- to) space. 

Proof. By using the KMP automaton for the reversed pattern, we mark the 
predict nodes at which the pattern begins. Suppose that every predict node 
has a pointer to the nearest proper ancestor that is marked. Such pointers are 
realized using 0(||M|j) time and space. This enables us to get the elements of 
X{u) in 0(|A1 (m)|) time. 

Theorem 13 follows from the lemmas above. 

6 Concluding Remarks 

In this paper we focused on the problem of compressed pattern matching for 
the text compression using antidictionaries proposed recently Crochemore et al. 
0. We presented an algorithm which has a linear time complexity proportional 
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to the compressed text length, when we exclude the pattern preprocessing. We 
are now implementing the algorithm to evaluate its performance from practical 
viewpoints. In m we showed that the Shift- And approach is effective in the 
compressed pattern matching for the LZW compression. We think that the Shift- 
And approach will be substituted for the KMP automaton approach presented 
in this paper and show a good performace in practice when the pattern length 
m is not so large, say m < 32. 

For a long pattern we can also consider the following method. Let k be 
the length of the longest forbidden word in the antidictionary. By using the 
syncronizing property we obtain: 

Lemma 10. If \V\ > k — 1, then 6{u,V) = S{e,V) for any state u in Q such 
that S(u, V) ^ M . 

Let p = <5(e, ■p). Since p G M implies that V cannot occur in T, we can assume 
p ^ M. If p is in Qi, then let q = Terminal{p) . Otherwise, let q — p. We can 
monitor whether the state of A{M) is in state p by using the function Sg to 
check Q{M) is in state q. If so, we shall confirm it. Our preliminary experiments 
suggest that this search method is efficient in practice. 
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Abstract. This paper examines some of the rich structure of the syn- 
tenic distance model of evolutionary distance, introduced by Ferretti, 
Nadeau, and Sankoff. The syntenic distance between two genomes is 
the minimum number of fissions, fusions, and translocations required to 
transform one into the other, ignoring gene order within chromosomes. 
We prove that the previously unanalyzed algorithm given by Ferretti 
et al is a 2- approximation and no better, and that, further, it always 
outperforms the algorithm presented by DasGupta, Jiang, Kannan, Li, 
and Sweedyk. We also prove the same results for an improved version 
of the Ferretti et al algorithm. We then prove a number of properties 
which give insight into the structure of optimal move sequences. We give 
instances in which any move sequence working solely within connected 
components is nearly twice optimal, and a general lower bound based 
on the spread of genes from each chromosome. We then prove a mono- 
tonicity property for the syntenic distance, and bound the difficulty of 
the hardest instance of any given size. We briefly discuss the results of 
implementing these algorithms and testing them on real synteny data. 



1 Introduction 

Numerous models for measuring the evolutionary distance between two species 
have been proposed in the past. These models are often based upon high-level 
(non-point) mutations which rearrange the order of genes within a chromosome. 
The distance between two genomes (or chromosomes) is defined as the minimum 
number of moves of a certain type required to transform the first into the second. 
A move for the reversal distance is the replacement of a segment of a chromo- 
some by the same segment in reversed order. For the transposition distance |2j, 
a legal move consists of removing a segment of a chromosome and reinserting it 
at some other location in the chromosome. 

In jSj, Ferretti, Nadeau, and Sankoff propose a somewhat different sort of 
measure of genetic distance, known as syntenic distance. This model abstracts 
away from the order of the genes within chromosomes, and handles each chro- 
mosome as an unordered set of genes. The legal moves are fusions, in which 
two chromosomes join into one, fissions, in which one chromosome splits into 
two, and reciprocal translocations, in which two chromosomes exchange sets of 
genes. In practice, the order of genes within chromosomes is often unknown, 
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and this model allows the computation of the distance between species regard- 
less. Additional justification follows from the observation that interchromosomal 
evolutionary events occur with relative rarity with respect to intrachromosomal 
events. (For some discussion of this and related models, see I3E1.) 

Ferretti et al propose using synteny as a measure of the distance between 
genomes, and present a heuristic to approximate this distance. Although they 
give some experimental data on its performance, no formal analysis of this ap- 
proximation algorithm is given. Identifying a performance guarantee for this 
algorithm has remained an open question since. 

In 13, DasGupta, Jiang, Kannan, Li, and Sweedyk show a number of re- 
sults on the syntenic distance problem. They prove that computing the syntenic 
distance between genomes is NP-hard, and provide a simple polynomial-time 
2-approximation. They also prove a number of other useful structural results. 



Our results. As with many NP-complete problems, reasoning about the syntenic 
distance is difficult. We are able, however, to show some results on the struc- 
ture of the problem, and analyze previously unanalyzed heuristics, including the 
original algorithm of Ferretti et al. These results give interesting insight into 
the rich structure of optimal move sequences. The structural properties aid in 
reasoning about the syntenic distance, and may lead to improved approximation 
algorithms. 

Using results from Q , we prove a general lower bound for the syntenic dis- 
tance between two genomes. Roughly, if for many chromosomes c in one genome, 
genes from c appear in many of the chromosomes of the other genome, then the 
instance is hard to solve. This lower bound may be helpful in developing im- 
proved approximation algorithms, since it implies that for the class of instances 
in which this scattering occurs, previously proposed algorithms are less than a 
factor of 2 away from optimal. 

We prove a monotonicity theory for syntenic distance, showing a natural or- 
dering on the difficulty of problem instances. We define the syntenic diameter of 
order n Dy(n) (in the spirit of the reversal and transposition diameters 0) as 
the maximum number of moves required to solve an instance of size n. Mono- 
tonicity identifies a worst instance of size n, and implies that Dy(n) is exactly 
the number of moves required to solve this instance. We prove that this worst 
instance requires between 2n — 3 — log4/3(2n — 3) and 2n — 3 moves, using our 
lower bound. We leave open the question of providing tighter bounds on this 
distance, though we conjecture that the minimum number of moves required to 
solve this instance is exactly 2n — 3. 

Instance-by-instance comparison of two heuristics is a valuable notion that is 
rarely explored. This type of analysis leads to very strong results in comparing 
the performance of two approximation algorithms, even those with the same 
approximation ratio. Using this technique, we analyze the previously unanalyzed 
approximation algorithm given by Ferretti et al, settling the open question of 
finding a performance guarantee for this algorithm. We prove that this algorithm 
is never worse than the approximation algorithm presented by DasGupta et al 
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in H, immediately giving a performance guarantee of 2. We also show that there 
are instances in which the algorithm performs 2 — e away from optimal. 

We also consider the algorithm resulting from making the fixes necessary to 
handle the instances in which the original algorithm performs 2 — e away from 
optimal. We prove the same results about this modified algorithm: it is also a 
2-approximation that always outperforms the DasGupta et al algorithm, and 
there are instances in which it performs a factor of 2 — e away from optimal. 

Call the connected components of an instance the connected components 
of the intersection graph of the chromosomes. We prove the surprising result 
that there are instances in which the optimal move sequence must connect two 
unconnected components, and any move sequence that fails to do so is in fact 
2 — e away from optimal. This implies that any approximation algorithm that 
works only with components (as all currently proposed algorithms do) is doomed 
to be no better than a 2-approximation. This raises the new problem of connected 
synteny, in which the move sequence is constrained to work only within connected 
components. The above results indicate that the algorithms presented in j2j 
and 0 (and the modified version of the latter) are only 2-approximations for 
connected synteny, as well. 

We also discuss a preliminary implementation of the syntenic distance model 
and all of the algorithms discussed above. We discuss the results of running all 
three algorithms on eight sets of real synteny data from the Institut National de 
la Recherche Agronomique (INRA) Comparative Homology Database 0. 

2 Notational Preliminaries and Previous Heuristics 

The syntenic distance model is as follows: a chromosome is a subset of a set 
of n genes, and a genome is a collection of k chromosomes. A genome can be 
transformed by any of the following moves (for S, T, U, and V non-empty sets 
of genes): (1) a, fusion (S,T) — > U, where U = SUT; (2) a fission S — s- {T,U), 
where TU ?7 = S'; or (3) a transloeation {S,T) — > {U,V), where U\JV = SUT. 
The syntenic distance between genomes Qi and Q 2 is then given by the minimum 
number of moves required to transform Qi into ^ 2 - 

The compact representation of an instance of synteny is described in ^ and 
formalized in jSj. This representation makes the goal of each instance uniform 
and thus eases reasoning about move sequences. For an instance in which we 
are attempting to transform genome Qi into genome Q 2 , we relabel each gene a 
contained in a chromosome of Qi by the numbers of the chromosomes of Q 2 in 
which a appears. Formally, we replace each of the k sets S in Qi with IJ^ ggl* | i G 
Gi} (where Q 2 = Gi, G 2 , . . . , G„) and attempt to transform these sets into the 
collection {!}, {2}, . . . , {n}. As an example of the compact representation (given 
in |S|), consider the instance 

Qi = {x,y}, (Chromosome 1) G 2 = {Pt( 1 ,x}, (Chromosome 1) 

{p, q, r}, (Chromosome 2) {a, b, r, y, z} (Chromosome 2). 

{a, b, c} (Chromosome 3) 
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The compact representation of Q\ with respect to Q 2 is {1, 2}, {1, 2}, {2} and 
the compact representation of Q 2 with respect to is {1, 2}, {1, 2, 3}. For an 
instance of synteny in this compact notation, we will write S(n, k) to refer to the 
instance where there are n elements and k sets in the compact representation. 
Let D(S{n, k)) be the minimum number of moves required to solve a synteny 
instance 5(n, k). 

We will say that two sets Si and S 2 are connected if SiC\ S 2 ^ 0 , and that 
both are in the same component. For a gene let count(£) be the number of 
chromosomes in which i appears. 

In 1^, Ferretti et al present the approximation algorithm reproduced in Fig.Q 
which we denote by T . (Two genes are syntenic iff they appear in the same 
chromosome.) Although they provide some empirical evidence on the algorithm’s 
performance, they do not give any formal analysis. 



Select an uneliminated gene i to work on, under the following priorities: 

Priority (A). Any I for which count(f?) = 1. 

Priority (B). Any I for which count(f?) = 2. 

Priority (C). If all count(£) > 2, pick I which minimizes count(.^) and, if there are 
several such, which minimizes count(f!') for some (! in the chromosome remaining 
from the last operation involving 1. If there are several such, choose I so that 
after it is operated on, ^^count(^) is minimized. 

For the I selected above, do one of the following operations: 

Operation (1). If count(^) = 1 and some of the genes syntenic with I appear in 
no other chromosomes, effect a fission to create a separate chromosome {£}. 
Operation (2). If count(f?) = 1 and all genes £' syntenic with i appear in 
count(^') > Cmin > 1 chromosomes, effect a translocation to obtain a separate 
chromosome {£}. The second chromosome involved in the translocation is one 
that contains some gene £! syntenic with with count(^') = Cmin, and, if there 
are several, with a maximal number of genes syntenic with £. 

Operation (3). If count(f!) > 1, effect count(^) — 2 fusions followed by one translo- 
cationU again to produce a separate {£}. 

“ This translocation could actually be a fusion if no other genes are present in the 
component. 



Fig. 1. The approximation algorithm T j^l. 

Let Ji denote the approximation algorithm defined in |3j: for each con- 
nected component containing ni elements and ki sets, perform ki — 1 fusions 
to produce one set with all elements, then — 1 fissions to produce the 
Ui singletons. Thus in an instance with p components, Ti. requires n + k — 2p 
moves. DasGupta et al show that this algorithm is a 2-approximation, a tight 
bound (the algorithm performs a factor of 2 away from optimal on the instance 
{!}, {1, 2}, . . . , {1, 2, . . . , n}). To derive the performance guarantee for Ti, Das- 
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Gupta et al prove the following component bound: if an instance of synteny 
S{n, k) has p components, then D{S{n, k)) > n — p. 

3 An Analysis of iF 

In this section, we prove a number of results about T . We first show that T is 
never worse than Ti., and is therefore a 2-approximation. We then show that the 
factor of 2 is tight by giving a class of instances in which T performs a factor of 
2 away from optimal. (In Sect. 01 we give a modification of T that handles this 
class of instances optimally, and analyze the modified algorithm.) 

Theorem 1. For any instance S{n,k) of synteny, |lF(5(n, /c))| < k))\. 

Proof. Suppose that T generates a move sequence cr on 5 (n, k) . Suppose that in 
a there are mi fissions (from Operation (1)), translocations (from Operations 
(2) and (3)), and m 3 fusions (from Operation (3)). 

Every translocation generated by Operation (2) is of the form (S'U{£}, T) — s- 
(S'ur, {£}) where i ^ SUT and, for some gene i', £' G SYlT. Every translocation 
generated by Operation (3) is of the form {S U {£},T U {£}) — > (S' U T,{£}) 
where £ ^ S U T. Note that in either case, at the time that {£} is produced, it 
appears nowhere else in the genome (i.e., count(£) = 1). 

We create a new move sequence a' which differs from cr in that each translo- 
cation (Si U S 2 ,Ti U T 2 ) — > (Si U Ti,S 2 U T 2 ) is replaced by the two-move 
sequence (Si U S 2 , Ti U T 2 ) — > Si U S 2 U Ti U T 2 — ^ (Si U Ti, S 2 U T 2 ). 

By the form of the translocations and this translation, we have the following 
facts: 

— Each of the newly-created fusions is within a connected component (the 
input sets are connected by £' for Operation (2) and £ for Operation (3)). 

— Each of the newly-created fissions produces a singleton {£} for a gene £ that 
appears nowhere else in the genome. 

Now we examine the fusions and fissions in a. Each original fusion (from 
Operation (3)) is also within a component (the two input sets are connected 
by £), and each fission (in Operation (1)) produces a singleton of a gene that 
appears nowhere else in the genome. Thus, every fusion in tr' fuses two sets in 
the same component, and every fission in a' produces a singleton set with an 
element that appears nowhere else in the genome. 

Clearly we can rearrange a' to completely solve each component before begin- 
ning the next, since there are no intercomponent dependencies. Further, inside 
each component we can put all the fusions before all the fissions, since the fis- 
sions merely remove the last instance of an element from a larger set. In other 
words, after rearrangement, cr' does exactly what TL does: within each compo- 
nent, it fuses all the sets into one massive set, and then fissions off individual 
elements one at a time. Note that |cr'| = m\ + 2 • m 2 -I- m 3 = m 2 + |cr|, and 
thus |cr| = |cr'| — m 2 = |7f(5(n, A:))| — m 2 . In other words, J- performs m 2 moves 
better than 7£ on each input. □ 
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Corollary 2. T is a 2- approximation. 

Proof. Immediate from Theoremnand the fact that is a 2-approximation. □ 

We now show the corresponding lower bound for T\ 

Lemma 3. For any e > 0, there exists an instance S{n,k) with |jF(5(n, fc))| > 
{2-e)-D{S{n,k)). 

Proof. Select any n such that l/(n — 1) < e. We give a synteny instance S{n, n) 
such that D{S{n, n)) = n — 1 and |jF(5(n, n))| = 2n — 3. Then the ratio between 
the result of J- and the optimal is (2(n — 1) — l)/(n — 1), i.e., only l/(n — 1) 
better than two times optimal. 

The instance 5(n, n) consists of {1, 2, . . . , n} and n — 1 copies of {n}. Here 
is an n — 1 move sequence solving the instance: 

[n — 1 moves] For i = 1 to n — 1, translocate the ith singleton {n} 

with the remaining elements of the large set to pro- 
duce the singleton {i}. 

Each move removes one of the n — 1 genes appearing only in the large set 
while absorbing another of the singleton {n} sets, so that after n — 1 of these 
moves all the ns have been joined. 

Now, we examine what T does on this input. Genes 1, 2, . . . , n — 1 are exactly 
symmetric in this instance, so we assume without loss of generality that T selects 
them in ascending order. 

[n — 2 moves] For j = 1 to n — 2, count(i) = 1. The gene n — 1 is 

syntenic with i and appears in no other chromosome, 
so by Operation (1) we fission off the singleton {i}. 
This leaves {n — 1, n} and n — 1 copies of {n}. 

[1 move] count(n — 1) = 1, so select it. By Operation (2), 

translocate {n — l,n} and {n} to produce {n — 1} 
and {n}. This leaves n — 1 copies of {n}. 

[n — 2 moves] Fuse the n — 1 copies of {n} by Operation (3). 

Thus iF requires n — 2 fissions, 1 translocation, and n — 2 fusions, or 2n — 3 
moves total. □ 

4 A Possible Improvement to fF 

Note that the non-optimality of F on the instance in Lemma 01 is only as the 
result of applications of Operation (1). When the applications of this operation 
have been completed, the resulting genome is {n — l,n} and n — 1 copies of {n}. 
T takes n — 1 more moves after this point, which, by the component bound, is 
optimal. The non-optimality of Operation (1) is sufficient to cause to be a 
factor of 2 away from optimal. The difficulty with T results from overzealous 
applications of Operation (1) when Operation (2) could do some good. (Notice 
from Theorem □ that the more translocations T does, the better its perfor- 
mance.) Call F' the algorithm resulting from making the following fixes to T to 
deal with this problem: 
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— Apply Operation (1) only if all of the genes syntenic with £ appear in no 
other chromosomes. 

— Apply Operation (2) if any gene syntenic with £ appears in another chromo- 
some. The second chromosome involved in the translocation is selected as in 
T, but ignoring those genes £' syntenic with £ such that count(£') = 1. 

Note that T' performs optimally on the bad instance for T in Lemma 0 
the genes are still selected in the same order, but each of the first n — 1 fissions 
becomes a translocation, and the instance is solved after these translocations. 

Theorem 4. For any instance S{n,k) of synteny, \£F'{S{n,k))\ < \Ti.{S{n, k))\. 

Proof. Analogous to the proof of Theorem [H □ 



Corollary 5. £F' is a 2- approximation. □ 

The following lemma shows, however, that T' is also no better than a 2- 
approximation . 



Lemma 6. For any e > 0, there exists an instance S{n,k) with |lF'(5(n, /c))| > 
{2-e)-D{S{n,k)). 

Proof. We give an instance 5(a*-|-l, oi-l-i-l-l) with Z3(5(ai-|-1, = ai+i 

and £F'{S{ai -I- 1, az -I- f -I- 1)) = 2o;z — 1 for 3 < a < z. Selecting a > 3 and z > a 
so that e > (2z -|- \)/{ai + i) then gives us an instance in which T' performs 
(2 — e) away from optimal. 

Consider the instance S(ai -I- 1, az -I- z -I- 1) consisting of the following sets, 
for 3 < a < z: 



— S' = {1, 2, . . . , m -I- 1} 

— Xj,q = {j}, for 1 < j < z and 1 < g < a 

— z copies of ^ = {z -I- 1, z -I- 2, . . . , az -I- 1}. 



Here is a move sequence solving Sloti -I- 1, az -I- z -I- 1) in az -I- z moves: 



[z — 1 moves] Fuse the z copies of Z, leaving S, the Xj^qS, and Z. 

[1 move] Translocate S and Z to produce {l,2,...,az} and 

{az -I- 1}. 

[(a — l)z moves] Translocate (a — 1) of the singletons for each of the 
genes 1, 2, . . . , z with the set {1,2,..., az} to produce 
the singletons {z -I- 1}, {z -I- 2}, . . . , {az}. This leaves 
{1,2,. ..,i} and {1}, {2}, . . . , {i}. 

[z moves] For j = 1 to z — 1, translocate {j} with the large set to 

remove j from it. This leaves two copies of {z}. Fuse 
these to solve the instance. 

We now examine what T' does on this instance. Notice that 



count(^) 



a -I- 1 for £ G {1, 2, . . . , z} 
z -I- 1 for £ G {z -£ 1, z -I- 2, . . . , az -I- 1}. 
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Since 1,2, ... ,i are completely symmetric in this instance, without loss of gen- 
erality we assume that the algorithm picks them in ascending order. Similarly, 
i + l,i + 2, . . . ,ai + 1 are symmetric, and we assume without loss of generality 
that they are selected in ascending order, as well. 

[a moves] a < t, so we first select £ = 1. Applying Operation 

(3), we fuse all singletons {1} and then translocate 
with S to produce {1} and {2, 3, . . . , m -I- 1}. 

\a moves] Select £= 2. Apply Operation (3) as above to produce 

{2} and {3,4 ...,m-hl}. 



[a moves] Select £ = i. Apply Operation (3) as above to produce 

|z} and |z -I- 1, z -I- 2, . . . , az -I- 1}. The only remaining 
sets are z -I- 1 copies of the set Z. 

[i moves] Select ^ = z-l-1. Apply Operation (3) to fuse z copies of 

Z, and then translocate the last two copies to produce 
|z -I- 1} and |z -I- 2, z -I- 3, . . . , m -I- 1}. 

[{a — l)z — 1 moves] Fission the remaining set (z -I- 2, z -|- 3, . . . , m -I- 1} into 
singletons, by Operation (1). 

T' thus has to complete az -I- z -I- (a — l)z — 1 = 2az — 1 moves to solve this 
instance. □ 

Note that T does poorly on these instances, as well — bad choices of the 
genes by Priority (C) are sufficient to cause the non-optimality, and T selects 
genes in the same way as T' . 



5 Moves between Connected Components 

It seems intuitive that when attacking an instance of synteny consisting of two 
distinct connected components, the optimal move sequence would never fuse 
these components together. Both H. and T (and IF') work within connected 
components, in fact. However, the following theorem shows that this approach 
is doomed to be no better than a 2-approximation. 

Theorem 7. For any algorithm A that works only within connected compo- 
nents, and for any e > 0, there exists an instance S{n,k) where jA(5(n, fc))j > 
(2 — e) • D{S{n, k)). 

Proof. We construct an instance of synteny S{n, n) solvable in n — 1 moves, but 
for which A will require 2n — 4 moves. Selecting n so that e > 2/(zz — 1) yields 
an instance where A is 2 — e away from optimal. 

Consider the instance S{n, n) consisting of (1, 2, . . . , n — 1} and n — 1 copies 
of |n}. First we observe that there is a move sequence solving this instance in 
n — 1 moves: 

[1 move] Translocate (1,2,. ..,n — 1} and {n} to produce {1} 

and {2, 3, . . . , zz — 1, n}. 
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[n — 2 moves] For i = 2 through n — 1, perform a translocation of 

the set {i,i + 1, . . . , n} and {n} to produce {z} and 
{z + 1, z + 2, . . . , n}. 

For any algorithm A working only within components, however, the moves 
that A can make are severely limited. Since {1, 2, . . . , n — 1} is a component all 
by itself, there is no choice but to complete n — 2 fissions. The n — 1 copies of {n} 
also form an entire component by themselves. Thus the only possible moves are 
to complete n — 2 fusions to create a unique singleton. Therefore, A completes 
2n — 4 moves on this instance. □ 

It is now natural to define the connected synteny problem, to find the min- 
imum number of moves required to transform one genome into another with 
all moves constrained to work only within a single component. We will use 
D{S{n, k)) to denote the minimum number of moves within components required 
to solve a synteny instance S(ji, k). 

Obviously, D{S{n,k)) > D{S{n,k)). Because Ti., T, and T' all generate 
move sequences that work within components, these algorithms are also 2- 
approximations for the connected synteny problem. In each of the examples 
in which these algorithms are 2 — e away from optimal, the optimal move se- 
quence works only within components. (In fact, there is only one component in 
each example.) Thus 7i, T , and T' are all 2-approximations for this problem, 
and no better. Whether it is easier to approximate connected synteny, however, 
remains an open question. 

6 Non-redundancy and Monotonicity 

In this section, we show that, for any instance, there is an optimal move sequence 
containing no moves that produce two sets with non-empty intersection. We also 
prove a monotonicity property for syntenic distance. 

We first need to introduce an extension to our notation to handle the case of 
empty sets as input. If Si, . . . , S'fc is a collection of sets and, for some z. Si = 0, we 
understand the synteny instance S(ji,k) = Si, . . . , Sk to represent the synteny 
instance T{n,k — 1) = S\, . . . , Si-i, Si+i , . . . , Sk- 

Lemma 8. If there is a move sequence a = {a\, am) solving S\, . . . , Si\J 

{a}, . . . , Sk where a ^ Si (with Si possibly empty), then there exists a move 
sequence a' solving S\, . . . , Si, . . . , Sk in at most m steps. 

Proof (by induction on m). For the base case (m = 1), a\ must solve the in- 
stance. We have two cases (a cannot appear in more than one additional set, 
since otherwise no single move could solve the instance): 

— The element a also appears in some set Sj,^i- 

a\ must take Si U {a} and Sj as input, and produce the singleton {a} as 
output. Otherwise, two copies of the gene a remain, or the copy of a is 
bundled up with some other element(s). This first restriction implies that a± 
cannot be a fission. 
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If (Ti is the fusion {Sj,Si U {a}) — > {a}, it must be the case that Sj = {a} 
and Si = 0. Thus Si, . . . ,Sk is already in the target form, and in the new 
instance we are done without making any move. 

If (Ti is a translocation, a must occur in only one of the output sets, for 
otherwise it appears twice and the instance is not solved. Thus ai = {Si U 
{a}, Sj) — > {Si U [S'j — {a}], {a}). We can replace this by <j{ = {Si, Sj) — > 
{Si U [Sj — {a}], {a}) to solve the instance Si, ... , Sk. 

— a does not appear elsewhere in the genome. 

Then the last move need not involve the singleton {a}. If it does not, then it 
must be the case that Si = ^. (Otherwise after the last move of the sequence 
a is in a non-singleton and the instance has not been solved.) In this case, 
simply doing the last move will solve Si, ... , Sk. 

If the last move does involve 5^ U {a}, it is not a fusion since any fusing 
would couple a with some other element, (a would have to be coupled with 
some element b ^ a, since a does not appear elsewhere in the genome.) 

If (Ti is a fission, then it must produce a singleton set {a} and some other 
set not containing a in order to solve the instance. Since a ^ Si, this means 
that (Ti = U {a} — > ({a}, Si). If we replace U {a} by Si in the instance, 
the instance is already in the target form and we can skip this move. 

If (Ti is a translocation, it must be {Sj,Si U {a}) — > {U, {a}) for some set U, 
or else (as with the fusion case) the instance would not be solved. If a G U 
then the instance is not solved, since a appears twice. Therefore it must be 
the case that U = SiU Sj. To solve the new instance, we can simply do the 
fusion {Si, Sj) — > Si U Sj and we are done. 

For the inductive case (m > 2), first we handle the cases when ai is any 
move that does not involve the set Si U {a}. For I and j distinct from v. 

— ui = {Si, Sj) — > Si U Sj. Then Cf 2 ...m solves Sr{f < r < k,r ^ i,r ^ j,r ^ 
i). Si U {a}, Si U Sj. By the inductive hypothesis, we have a move sequence 
a' solving 5^(1 < r < k,r ^ £,r ^ j,r ^ i),Si,Si U Sj in at most m — 1 
moves. 

— ai = Si — > {U,V). Then a 2 ...m solves 5^(1 < r < k,r £,r i). Si U 

{a},U, V. By the inductive hypothesis, we have a move sequence a' solving 
«S'r(l < r < k,r ^ £,r ^ i). Si, U, V in at most m — 1 moves. 

— (Ji = {Si,Sj) — > {U,V). Then a 2 ...m solves 5^(1 < r < k,r £,r j,r 
i). Si U {a}, U, V. By the inductive hypothesis, we have a move sequence a' 
solving 5^(1 < r < k,r ^ £,r ^ j,r ^ i), Si,U,V va at most m — 1 moves. 

In each case, doing cti and a' solves Si, . . . , Sk in at most m moves. We now 
consider the cases in which ai takes S'! U {a} as input. 

— Suppose (Ji is a fission, and that Si = Si., U Si,,. 

If ai = SiU {a} — > {Si, U {a}, Si^), then a 2 ...m solves the instance 5^(1 < 
r < k,r ^ i), Si,U {a}, Si^ . By the inductive hypothesis, there is a a' solving 
5^(1 < r < k,r i),Si,,S ,2 in at most m — 1 steps. Then doing Si — > 
{Si, , Si„) followed by a' solves Si, . . . , Sk in at most m steps. 
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If ai = SiU {a} — > {Si^ U {a},Si^ U {a}), then a2...m solves the instance 
Sr{l < r < k,r i) ,Si, U {o}, Si^ U {o} in TO — 1 moves. By the inductive 
hypothesis applied to a2...m and U {a}, there is a <t' solving S'r(l < 
r < k,r ^ Si^ U {a} in at most to — 1 steps. Applying the inductive 

hypothesis again, this time to a' and Si^ U {a}, we have that there is a a” 
solving 5^(1 < r < k,r i),Si^,Si^ in at most to — 1 steps. Then doing 
Si — > S'ij) followed by a" solves 5 i, . . . , S'fc in at most to steps. 

— Suppose that u\ is the fusion {Si U {a}, Si) — > Si U {a} U Si. Then a2...m 
solves the instance S'r(l < r < to, r yf r yf z), S'i U {a} U 5 ^ in to — 1 steps. 
If a G Si, then Si U {a} U S'^ = S'z U Si- Thus a2...m solves Sr{l < r < iTi,r 
i,r i i), SiU Si, and doing {Si, Si) — > Si U Si and a2...m solves S\, . . . , Sk 
in TO steps. 

If a ^ 5 ^, then by the inductive hypothesis, there is a a' solving £'^(1 < r < 
TO, r yf r yf z), U in TO — 1 steps. To solve S\, . . . , Sk, we do the fusion 
{Si, Si) — > Si U Si and run a', which requires at most to steps. 

— Suppose a I is a translocation using the set £i U {a} and Si, where Si = 

U Si2 and Si = U Si^. Then a\ must look like one of the following: 

( 1 ) {Si, U {a}) — > (5^1 U U U {a}) 

( 2 ) {Si, St U {a}) — > {Sit U Sit U {a}, Si^ U Si^ U {a}). 

In either case we replace this move by the translocation a[ = {Si, Si) — > 

{Sit U £ji, £^2 U 'S'ia)- 

In case (I), if a G S{^, then a2...m solves £^(1 < r < /c, r yf t', r yf z), £^^ U 
Sit, ^t-2 U 'S'g in to — 1 steps, since £^2 U £13 U {a} = £^2 U ^12 - Then we can 
do a) and Cf2...m to solve S\, . . . , Sk in to steps. 

If a ^ Si^, then Cf2...m solves Sr{l < i" < k,r £,r ^ i), Sit U > ^^2 U U 
{a} in TO — 1 steps. By the inductive hypothesis, there is a move sequence 
cr' solving £r(l < r < k,r i,r i). Sit U ) ^^2 U most to — 1 

steps. Gluing this together with yields a sequence solving £1, . . . , £^ in at 
most TO moves. 

In case ( 2 ), if a G Sit then Sit U U {a} = Sit U ‘^h this move is 
actually {Si, Si U {a}) — > (£^^ U Sit,S{^ U Si^ U {a}), which is exactly case 
( 1 ). Otherwise, a ^ £^i. If a G Si^, for exactly the same reason as above 
(with the roles of Sit Si^ reversed), we are again in case ( 1 ). Thus the 
only interesting case is when a ^ Sit a ^ Si^. 

In this case, a2...m solves £^(1 < r < k,r ^ £,r ^ i), Sit U U {o})5'^2 U 
£^2 U {a} in TO — 1 moves. By the inductive hypothesis applied to a2...m and 
Sit U U {a}) "'^6 have a move sequence a' solving £^(1 < r < k,r ^ £,r ^ 
i). Sit U 5 'ii j Sitt U £i 2 U {a} in at most to — 1 moves. Applying the inductive 
hypothesis again, to cr' and £^2 U £^2 U {a}, we have a move sequence a" 
solving £r(l < r < k,r ^ £,r ^ i),Sit U £^^,£^2 U ^i2 in at most to — 1 
moves. This is exactly the result of doing the translocation a), so doing 
and cr" solves £1, . . . , £fc in at most to moves. □ 

Define a redundant move as any move creating two sets £ and T such that 
£nT yf 0 . (Note that only fissions and translocations can be redundant, because 
fusions do not create two sets.) 
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We need the following result on reordering from |2] to prove a theorem on 
redundancy: for S{n, k) an instance of synteny and cr = (cti, . . . , Um) any move 
solving the instance with mi fusions, m2 translocations, and m3 fissions, there 
exists a move sequence a' solving the instance in m' < m moves in which every 
fusion precedes every translocation precedes every fission, using m^ < m\ fu- 
sions, m'2 < m2 translocations, and m^ < m3 fissions. (DasGupta et al actually 
state this lemma for the case where a is optimal, but the proof extends to a 
general cr straightforwardly.) We refer to a move sequence in which the fusions 
precede the translocations precede the fissions as in canonical order. 

Theorem 9. For any synteny instance S{n,k), there exists an optimal move 
sequence making no redundant moves. 

Proof. Let a = (cri, . . . , 0-^) be a canonically-ordered optimal move sequence 
solving S{n,k). There are no redundant fusions at all (by the definition of a 
redundant move). Any redundant fission must yield two copies of at least one 
gene a, say S'! U 52 U {a} — > (5i U {a}, ^2 U {a}). But then there are two copies 
of the gene a, and since all succeeding moves are also fissions, the number of as 
can only increase, and therefore the instance will not be solved. 

Then the only possible redundant moves are translocations of the form (Ti U 
T2 U y, C/i U C/2 U W) — > (Ti U C/i U y U W, T2 U C/2 U f/ U W) for some non-empty 
overlap VUW, with Vn{TiUT2) =0 and Wn{UiUU2) = 0. Then by repeatedly 
applying the transformation described in Lemma |S| to cr for every element of 
V U W, we can solve the instance resulting from replacing this redundant move 
by the translocation (Ti U T2 U F, C/i U C/2 U W) — > (Ti U C/i U F U W, T2 U C/2) in 
at most as many moves. Repeating this sequentially for every redundant move 
in cr yields a move sequence of length at most m with no redundant moves. □ 

Note that the canonicalizing process does not create redundancies: with a 
non-redundant move sequence as input, it produces a non-redundant canonical 
move sequence as output . Thus we can convert any move sequence cr into a non- 
redundant canonical move sequence by consecutively applying canonicalization, 
redundancy elimination, and canonicalization again. 

Theorem 10 (Monotonicity). Let Si,...,Sk and Ti,...,Tfc be two collec- 
tions of sets where, for all 1 < i < k, Ti C Si. Let n = | Ui ‘^*1 n' = | Ui ^*1- 
Let S{n,k) = Si,...,Sk and let T{n',k) = Ti, . . . ,Tk. Then D{S{n,k)) > 
D{T{n',k)). 

Proof (by induction on S = 1‘5'i ~ Ti\). 

Base case (5 = 0). Then each Si = Ti, and T{n',k) = S{n,k), so their 
distances are trivially equal. 

Inductive case (<5 > 1). Let cr = (cri, ct2, . . . , (Jm) be an optimal move sequence 
solving S{n,k). Let j be the minimum index such that Sj D Tj and let a be any 
element in Sj — Tj . By applying the transformation described in Lemma |S1 we 
can convert cr into a cr' solving Si, S2, . ■ ■ , Sj — {a}, . . . , 5fc in at most m steps. 
This instance is one element “closer” to T(n', k), so, by the inductive hypothesis, 
we can solve T{n' , k) in at most m steps. □ 
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7 A Lower Bound on Synteny 



In this section, we give a lower bound on syntenic distance when many elements 
appear in many sets. The intuition for the bound is the following: suppose that, 
in the compact representation, many genes appear in many chromosomes. This 
occurs exactly when, in the non-compact representation, for many chromosomes 
c in genome Qi, genes from c appear in many of the chromosomes in genome Q 2 - 
This can only occur if many evolutionary events “scattered” c from Qi to Q 2 - If 
this occurs for many chromosomes c, then many events must have occurred for 
many chromosomes, and thus the distance between the genomes must be large. 

To prove this lower bound, we need the following restricted form of the syn- 
teny problem, defined in jS]. Define the linear synteny problem as the synteny 
problem in which all move sequences are constrained as follows: 

— The first k — 1 moves must be fusions or severely restricted translocations. 
One of the input sets is initially designated as the merging set. Each of the 
first k — 1 moves takes the current merging set A as input, along with one 
unused input set S, and produces a new merging set A'. If some element 
a appears nowhere in the genome except in A and S, then the move is the 
translocation {A,S) — > (Z\',{a}), where = (Z\ U S') — {a}. If there is 
no such element o, then the move simply fuses the two sets: (Z\, S) — > A' , 
where A' = AU S. 

— If Z\ is the merging set after the k — 1 fusions and translocations, then each 
of the next |Z\| — 1 moves simply fissions off a singleton {a} and produces 
the new merging set A' = A — {a}. 

Let D{S{n, k)) be the length of the optimal linear move sequence. Note that 
if a linear move sequence performs mi fusions in the first k — 1 moves, then the 
move sequence contains k — mi — 1 translocations. After the k — 1 fusions and 
translocations are complete, there are n — fc-|-mi-|-l elements left in the merging 
set, since exactly one element is eliminated by each translocation. Therefore, 
n — k + mi fissions must be performed to eliminate the remaining elements. 
Thus the length of the linear move sequence is n -I- mi — 1 moves. (Every move 
either is a fusion or removes one element, and all but the last element must be 
removed.) 

Theorem 11. For any instance of synteny S{n,k), 



DiSin, k)) > n — 1 -I- max 

Kc<k-1 



c — 



I count(f) < c -I- 1} 



Proof. Consider an arbitrary c between 1 and fc — 1, and consider any linear move 
sequence solving S{n, k). In the first c moves, only genes i such that count(£) < 
c-l- 1 can be eliminated. (Any i with count(£) > c-l- 1 remains present in at least 
one unused input set, since the first c moves can only merge c -I- 1 sets.) 



Thus, in the first c moves we have at most 



tions, and therefore at least c— 
requires at least n — 1 -I- c — 



{£ I count(£) <' c -I- ij- 
I count(^) < c -I- 1} 



{£ I count(f) < c-l- l} transloca- 
fusions. Thus the instance 
inoves to solve. □ 
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DasGupta et al prove that for any instance S{n, k) of synteny, D{S{n, k)) < 
D{S{n,k)) + logf, D(5(n, fc)), for some constant b. (In the full version of their 
paper, DasGupta et al show that we can take b = 4/3.) This gives us the following 
bound on the general synteny problem: 

Corollary 12. For any instance of synteny S{n,k), 

D{S{n, k)) + log4/3(D(5(n, k))) 

> n — 1 + max 

l<c<fc-l 



I count(^) < c + l} 



□ 



This bound may help in the development of improved approximation algo- 
rithms for the (linear) synteny problem. In particular, for a significant class of 
instances, H is better than a factor of 2 away from the linear optimal solution: 



Corollary 13. For any instance S{n,k) of synteny in which n > k, if there 
exists some c such that c — | count(£) < c -I- 1} > !3n + 1 for (3 G (0, 1], then 



\n{S{n,k))\ ^ 

D{S{n,k)) / 3 + 1 ' 



Proof. Suppose that 5(n, k) has p components. Then 

\H{S{n,k))\ ^ n + k — 2p ^ 2n — 2p 2n 2 

D{S{n,k)) n — 1 + (3n + 1 (1 -|- /3)n {l + 0)n 1 + 13 



□ 



8 Syntenic Diameter 

In this section, we consider the notion of the hardest instance of a given size, 
and give bounds on how hard it is. We define the syntenic diameter of order n 
as 

Dy{n) “= max Z?(5(n, n)), 

<S(n,n) 

the number of moves required to solve the worst instance of up to n elements 
and n sets. We also define the complete n-instance /C„(n, n) of synteny, which 
consists of n copies of the set {1, . . . , n}. 

Lemma 14. Dy{n) = D(/C„(n, n)). 

Proof. Immediate from monotonicity. □ 

Lemma 15. D{ICn{n, n)) = 2n — 3. 
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Proof. All genes appear n times, so 



n — 2 — 



{£ I count(^) < n 



1 } 



= n — 2. 



By Theorem rm then, Z?(/C„(n, n)) > 2n — 3. We easily have that D{JCn{n,n)) < 
2n — 3: complete n — 2 fusions to leave two copies of {1, . . . ,n}, complete 1 
translocation to eliminate n and leave 1}, and then n — 2 fissions 

to solve the instance. This is a linear move sequence of length 2n — 3 solving 
/C„(n,n). □ 



Theorem 16. 2n — 3 > Z?(/C„(n, n)) > 2n — 3 — log4^3(2n — 3). 

Proof. Clearly for any synteny instance D{S{n,k)) < D{S{n, k)). Then, from 
the bound on linear synteny proved by DasGupta et al, we have that 

D{ICn{n, n)) > D{ICn{n, n)) > I)(/C„(n, n)) - log 4/3 -D(A„(n, n)) 

> I)(/C„(n,n)) - log 4 / 3 l)(/C„(n,n)). 

By Lemma m we have 

2n — 3 > D(/C„(n, n)) > 2n — 3 — log4/3(2n — 3). 

□ 

Note that this is almost tight, with only a log4^3(2n — 3) window for the 
syntenic diameter. Even more strongly, however, we conjecture that there is no 
way to solve /C„(n,n) in any faster way than the linear sequence described in 
the proof of Lemma El 

Conjecture 17. D{1Cn{n,n)) = 2n — 3. 



9 A Preliminary Implementation 

We have implemented all of the heuristics discussed above {H, T, and T') in 
Standard ML. The full implementation is approximately 750 lines of code. We 
have run these algorithms on eight sets of real synteny data, found on the INRA 
Comparative Homology Database We make the following observations based 
upon the results of these tests: 

— In all cases, T' performed at least as well as T\ on one of the eight data sets, 
T' outperformed T by one move. 

— In most cases (5 of 8), the component bound was actually attained by both 
T and T' . In the 6th case, T' achieved the component bound and T was 
within one move of it. In each of these six cases, the instance can be solved 
using only translocations. 

The latter point may raise some question about the validity of the model (that 
it is too easy to solve too many instances, and thus that the model fails to be 
informative about relative distances among groups of species), or may indicate 
that there is simply insufficient synteny data presently available. 
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10 Conclusions and Future Work 

We have proven a number of interesting structural results for syntenic distance, 
including monotonicity and the fact that improving the approximation ratio for 
this problem will require an algorithm that works among components. These 
results may help in solving the obvious remaining open question: 

— Is there an approximation algorithm for syntenic distance that achieves an 
approximation ratio strictly better than 2? 

The lower bound from Theorem^^may be useful in improving the approximation 
ratio. Other interesting open questions include: 

— Can connected synteny be approximated any better than general synteny? 

— Can we improve the bound on D(/C„(n,n)) and prove Conjecture EP 
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"A problem for the next century.” 
Paul Erdos 



Abstract. We focus on the combinatorial analysis of physical mapping 
with repeated probes. We present computational complexity results, and 
we describe and analyze an algorithmic strategy. We are following the 
research avenue proposed by Karp [2] on modeling the problem as a com- 
binatorial problem - the Hypergraph Superstring Problem - intimately 
related to the Lander- Waterman stochastic model m- We show that a 
sparse version of the problem is MAXSNP-complete, a result that carries 
over to the general case. We show that the minimum Spemer decompo- 
sition of a set collection, a problem that is related to the Hypergraph 
Superstring problem, is NP-complete. Finally we show that the General- 
ized Hypergraph Superstring Problem is also MAXSNP-hard. We present 
an efficient algorithm for retrieving the PQ-tree of optimal zero repeti- 
tion solutions, that provides a constant approximation to the optimal 
solution on sparse data. We provide experimental results on simulated 
data. 



1 Introduction and Previous Work 

Physical mapping using hybridization data involves the construction of genomic 
maps based on the information contained in the clone-probe hybridization ma- 
trix. The mapping technique has to cope with combinatorial difficulties that are 
specific to the hybridization data. There are errors like chimerism, false nega- 
tives or false positives, that come from the limitations in experimental accuracy. 
Errors introduce specific combinatorial problems whose solutions could provide 
good mapping hypotheses. Usually these optimization problems are NP-hard 
and various heuristics - based on generalizations of the Consecutive Ones Prop- 
erty (CIP) [15 - have been designed to cope with them e.g., m Another 
important combinatorial dimension of the mapping problem arises from the the 
fact that most probes have multiple occurrences on the genomic region to be 
mapped. The literature dealing with algorithms for mapping in the presence of 
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repeated probes is quite limited. In this paper we consider the combinatorial 
difficulties of physical mapping with repeated probes, we identify some compu- 
tational bottlenecks, and we propose algorithms that exhibit various degrees of 
measurable success. 

The fundamental modeling paper of the area is the paper by Lander and Wa- 
terman 1^ in which the widely accepted Lander- Waterman model is introduced 
and analyzed; see also PI and HH for further mathematical and statisti- 
cal analyses. According to the Lander- Waterman model, clones are distributed 
uniformly along the genomic region, and probes are distributed according to a 
Poisson distribution. 

The only published algorithmic work focussing on mapping with repeated 
probes seems to be p|, although further recent work devoted to the problem is 
in progress mu, |[3. In H algorithmic strategies are proposed, based on the 
Lander- Waterman model by attempting to approximate the likelihood function, 
leading to NP-complete optimization problems that are reasonably tractable in 
practice. The algorithmic strategy proposed there uses local search 3-opt Lin- 
Kernigan type heuristics. No approximation algorithms with a provable guar- 
antee were obtained. Based on this work, Karp [S| proposed the problem of 
designing approximation algorithms with guaranteed error bounds for the short- 
est superstring of a set collection - in our present terminology, the Hypergraph 
Superstring Problem. This optimization problem is a combinatorial problem inti- 
mately related to the Lander- Waterman model, capturing the search for minimal 
explanations of the hybridization data. This combinatorial problem was intro- 
duced before (see P-iii 0) and it is notoriously difficult 0, jn|. We are 
interested here in the sparse version of the problem, consistent with biologically 
relevant data of the Lander- Waterman model. 

Kou proves in a paper devoted to information retrieval and file organiza- 
tion mi that a variant of the CIP - modeling multiple storage of records - is 
NP-complete. In our terminology the result is that the Hypergraph Superstring 
Problem for strict Sperner hypergraphs is NP-complete. In 0, non-tight upper 
and lower bounds were obtained for the hypergraph superstring length for the 
special case of the hypergraph being the power set of a finite set. HU gives a 
comprehensive overview of the problem. 

A clone-probe hybridization matrix is a 0/1 matrix with rows representing 
clones, columns representing probes, and a 1 in position (i,j) if and only if 
probe j is incident to clone i. Any permutation of the columns of such a matrix 
results in the same clone/probe incidence relationship. A collection of clones 
has the Consecutive Ones Property (CIP) HI] if there is a permutation of the 
columns of the hybridization matrix that allows each row (clone) to be of the form 
0 • • • 01 • • • 10 • • • 0 - in a consecutive ones form. The obvious biological relevance 
of the CIP is that each clone spans a connected region of the genome. A clone- 
probe hybridization matrix containing “perfect” data, i.e., containing no errors 
and only unique probes, is a matrix that obeys the CIP. An important property 
for a heuristic mapping algorithm is to retrieve the CIP in the absence of errors 
0. This is one of the properties that our mapping algorithms achieve. 
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A feature of the Lander- Waterman model is the Sperner property of a set 
collection: no set is included in the other. Indeed, as the number of probes in- 
creases, the set of clones of the Lander- Waterman model has the Sperner prop- 
erty with high probability. The PQ-tcee algorithm HU that retrieves the CIP 
uses a framework that hierarchically decomposes the initial collection of sets into 
subcollections that avoid sets included in unions of other sets. 

The CIP property of a hybridization matrix ensures that there are no re- 
peated probes. The Sperner decomposition of a set collection satisfying the CIP, 
and the optimal merging of sets in such a collection to obtain a PQ-tree are 
relatively easy computing tasks. Both tasks become computationally intractable 
for very sparse instances of data with repeated probes To get insight into the 
new combinatorial difficulties, consider the intersection graph /G of a set col- 
lection. The vertices are the sets of the collection, and an edge exists between 
two vertices when the corresponding sets intersect. In the CIP case, the strict 
Sperner collections are sets of disjoint paths (SDP) in IG, while in the Hyper- 
graph Superstring Problem they are general graphs. These facts point out to the 
importance of strict Sperner collections, as building blocks in the hierarchical 
decomposition of the Hypergraph Superstring Problem. As we will see in this 
paper, both the Sperner decomposition as well as the optimal merging of the 
sets in a strict Sperner collection are MAXSNP- /NP-complete tasks. 

In all the above discussion the implicit assumption has been that a probe 
never appears more than once in a particular clone. This is a simplifying as- 
sumption that is justifiable probailistically by the Lander- Waterman model, as 
the Poisson parameter A governing probe distribution decreases. However, this 
property is not necessarily guaranteed in practice. In fact the genome deviates 
from the Lander- Waterman model by means of certain sequence patterns that 
are repeated and could cause higher than expected probe repetition. An alter- 
native model therefore, is to seek the minimal explanation of the hybridization 
data in the form of a multiset superstring that allows for possible repetition of 
probes in a single clone. We prove that this problem is also MAXSNP-complete. 

We present and test the GREEDY- MERGE algorithm that is based on 
Sperner decomposition of hyper graphs, with the following provable performance: 
(1) it retrieves the PQ-tree of all optimal zero-repetition superstrings; (2) on 
strict Sperner hypergraphs it is provably a 1.5625-approximation algorithm; (3) 
it provides a 2-approximation for hypergraphs with a restricted Sperner decom- 
position. The algorithm has cubic worst-case time complexity, and is much faster 
on sparse, biologically relevant data. We test the algorithm on data generated 
according to the Lander- Waterman model and found that it approximates the 
length of the initial (correct) superstring within a factor of 1.1 in most problems 
involving 100-200 clones, 200-400 probes, and 1.5 to 4.9 average probe repetition. 
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2 Background 

2.1 Physical Mapping 

DNA molecules are very long sequences over an alphabet of four letters, or 
nucleotides: {A, G, C, T}. The study of a genomic region involves breaking 
it into smaller pieces that can be sequenced by present technologies. Physical 
Mapping involves reassembling the true arrangement of the pieces on the initial 
genomic region, and then sequencing the smallest subset of pieces that cover the 
region. The cloning procedure incorporates the pieces of DNA into biological 
hosts. Each such copy is a clone. Through self-replication, a large number of 
copies of each clone are obtained. The result is a clone library containing many 
copies of pieces of the initial genomic region. The reconstruction process is based 
on data indicating “overlap” between clones. One method of detecting overlaps is 
through the hybridization of short sequences, called probes. Hybridization occurs 
when a probe sequence is complementary to a subsequence of a clone. If the 
probe has a unique occurrence on the initial genomic region and if two clones are 
hybridized by the same probe then they overlap. This assumes ideal experimental 
conditions, i.e., no errors. So, unique probes detect overlap. However, in general 
probes are complementary to multiple places on the genomic region so detecting 
overlap is ambiguous. The information contained in the hybridization data can 
be summarized as follows. Let the clones be {Ci,...,C'„} and the probes be 
{Pi, . . . , Pm}- Let the matrix P[ be defined by H[i,j] = 1 if probe Pj hybridizes 
to clone Ci, and H[i,j] = 0 otherwise. The problem studied in this paper is 
that of using hybridization data given in the matrix PI to reassemble the clones 
such as to reconstruct the initial genomic region. Let us note that the process 
of breaking the DNA into pieces and selecting probes, even in a perfect cloning 
and hybridization experimental scenario, might result in loss of information. 
Therefore, we may not be able to obtain the exact reconstruction. To well-define 
the problem, we aim at obtaining the maximal mapping information consistent 
with H . 



2.2 The Lander- Waterman Model 

We will first define the Lander- Waterman model and then formulate a combina- 
torial problem in terms of hypergraphs, an appropriate framework for probe/clone 
hybridization data. Then superstrings are introduced in order to search for the 
minimal number total repetition of the probes needed to explain the hybridiza- 
tion data. 

The Lander- Waterman Model 

1. A clone is an interval of length 1 contained in the interval [0, A^]. The left end- 
points of the clones are independent random variables, uniformly distributed 
over [0, — 1]. 
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2. Probes are distributed along the interval [0, N] according to in- 

dependent Poisson processes of rate A. That is, a probe occurs at a short 
interval of length dx with probability Xdx, and disjoint intervals are inde- 
pendent. 

2.3 The Hypergraph Superstring Problem 

Hypergraphs. A hypergraph is a pair H = {X,S), where A is a finite set, and 
S = {S'!, . . . , is a family of subsets of X. The sets Si are called hyper- 
edges. The following definitions apply to hypergraphs as well to families of sets. 
A hypergraph is i?-bounded if all of its hyperedges have at most B elements. 
A hypergraph is a chain if 5 = {S'!, . . . , S'™} and C £'2 C • • • C Sm- A 
hyper graph is called antichain, or Sperner, if no Si is included in Sj, for every 
J) 1 < j < w. A hypergraph is called strict Sperner if no hyperedge is 
included in the union of the other hyperedges, or equivalently every hyperedge 
has a characteristic element. 

A Sperner decomposition of a hypergraph H = (X, 5) is a decomposition of S 
into subfamilies of sets called levels Si,. . . ,St such that: ( 1 ) the levels partition 
S, i.e. 5 = 5i U • • • U Sm and SidSj = 0, 1 < i ^ j < t, (2) Si is a strict Sperner 
family of sets for every i, 1 < i <t and (3) IJ C [J ^2 C • • • C [J 

Consider the clone-probe hybridization matrix of a Lander- Waterman pro- 
cess. Let P be the set of probes, and let C = {C\, . . . Cm} be the clones viewed as 
sets of probes. Then Hlw = {P,C) is the associated hypergraph. According to 
the Lander- Waterman model, the arrivals of the left endpoints of the clones are 
distributed according to a Poisson process of rate If |P| is large enough, 
with high probability no clone is a subclone of any other clone. Then is a 

Sperner hypergraph. The average number of probes per clone is A|P|. 

Multiset Superstrings. A string a = ai ■ ■ ■ ar, is a multiset superstring of 
any subset of C/(cr) = {S' : 1 < /3 < 77 < r : S' = (cr^, ap+i, . . . , cr^}}. 

Set Superstrings. A string cr is a set superstring (or simply, superstring) 
of any subset of y((j) = (S : V/3 < i < j < 77 ai ^ aj,S = {ap, . . . , an}} 

For S € U(a) or S G V{a) we define Pa{S),r]a-{S) so that S = . . . , 

f^r;(S)}- We say that cr expresses S if S G U{a) (S G V{a), also denoted by S G cr. 
A multiset (set) superstring a is non-repeting if no letter in a occurs more than 
once. 

Now we are ready to define our main computational problems: 

The Hypergraph Set Superstring Problem: Given a Hypergraph H = 
{X, S) find a superstring cr = cti ... cr„ for iL of minimal length n. 

The Hypergraph Multiset Superstring Problem: Given a Hypergraph 
H = (X, S) find a multiset superstring cr = cti . . . cr„ for H of minimal length n. 

Remark. Let us remark that the corresponding Graph Superstring Problem, 
where the hyperedges have exactly two elements can be solved in time linear 
in the number of edges in the graph. The minimum superstring coincides with 
the Eulerian path if the graph has such a path. In the general case, it coincides 
with the minimum size collection of Eulerian paths that cover all the edges. 
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Our problem, the Hypergraph Superstring problem, is therefore a hypergraph 
generalization of the Eulerian path problem in graphs. 

The Sperner Decomposition of a Hypergraph Problem: Given a Hy- 
pergraph H = {X, S) and an integer fc > 0, decide whether there exists a Sperner 
decomposition into k levels. 

3 Computational Complexity of the Hypergraph 
Superstring Problems 

We show that the hypergraph set superstring, and the hypergraph multiset 
superstring problems are MAXSNP-hard. We prove these results with an L- 
reduction from TSP(1,2) on bounded degree undirected graphs. The same re- 
duction proves both problems to be MAXSNP-hard. We are thus strengthening 
Kou’s result by showing that the same problem is MAXSNP-hard, which im- 
plies that it is computationally intractable to approximate within better than 
a multiplicative constant of optimal. We also show that computing a Sperner 
Decomposition of a hypergraph is a hard computational task: it is NP-complete 
to decide whether a two-level decomposition exists and more generally, to find 
the Sperner Decomposition with a minimal number of levels. 

Theorem 1. The Hypergraph Set Superstring Problem and the Hypergraph Mul- 
tiset Superstring Problem are MAXSNP-hard even for 5-bounded strict Sperner 
hypergraphs. 

Proof. We use an L-reduction (intuitively a linear reduction, refer to Q) from 
TSP(1,2) on undirected graphs, on instances where the graph formed by length- 
one edges has bounded degree. TSP(1,2) is the traveling salesman problem with 
distances 1, 2. That is, given a complete graph G with edges of distance 1 and 2, 
find the shortest Hamiltonian path on the graph0 This problem has been shown 
to be MAXSNP-complete even if restricted to instances where the graph formed 
by the length-one edges has bounded degree [2|. 

Let He = (P, E) be a graph of bounded degree D specifying an instance of 
TSP( 1,2). That is, He contains the edges of cost 1 in the corresponding TSP( 1,2) 
graph G. For every v gV = {!,..., n}, with associated edges {v, ui), . . . , {v, Ud) 
where d < D, define hyperedge Sy = {z;, {v,zti}, . . . , {u,ztd}}. The hypergraph 
H is then (X, 5) where X = and S = GV}. Clearly the above 

reduction can be performed in logarithmic space. Notice that the resulting set 
collection is Sperner because every set Sy has a distinguishing element v G Sy. 
Moreover, Vd : |S'„| < D -|- 1. 

We will show that there is a Hamiltonian path on the graph G of TSP(1,2) 
of cost n — 1 -I- fc if and only if there is a (multiset, or set) superstring a for S 
of length TO -I- fc -I- 1 where to = \E\. Since He is a graph of degree bounded by 
D, m < D X n is linear in n. This will establish that the above reduction is an 
L-reduction. 

That is, the shortest path that visits each node exactly once. 



1 
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Say there is a Hamiltonian path of cost n—l + k. Since all edges have costs 
1 or 2, we know the path uses n — 1 — k edges from H and k edges of cost 2. 
Construct cr of cost m + k+1 as follows: a arranges the sets Sv in the order the 
nodes v are arranged on the path. Whenever an edge (u, v) in He is used on the 
path, Su and Sy overlap in one element in a. Then, 

S 

\a\ = |S'„| — in — 1 — k) = m + k+ l 

V — \ 

Conversely, say that cr is a superstring of length m+fc+1 = J2v=i — (n— A:— 1). 

Construct a path by reading in a each vertex in the order it appears. Since a 
is shorter than J 2 v=i l‘^«l by {n — 1 — k) there is a total overlap of {n — 1 — k) 
between the sets on the superstring. Since no two sets contain more than one 
common element, there are {n — 1 — k) sets that overlap. These sets have a 
common edge. This establishes a total of (n — 1 — fc) edges from Hq used in the 
path, and hence a path of cost (n — l + k). 



Theorem 2. The Sperner Decomposition of a Hypergraph Problem is NP-com- 
plete. In particular, distinguishing between 2 and 3 levels for the minimum 
Sperner decomposition of a hypergraph is NP-complete, even for 3-bounded hy- 
pergraphs with size < 1 hyperedge intersections. 

Proof. (Sketch). Given a hypergraph H = (X, S) and a partition of S into 61 , 62 , 
we can check efficiently the properties for a Sperner decomposition. Therefore, 
the Sperner Decomposition in k levels problem is in NP. We will show NP- 
hardness by a reduction from 3SAT. 

Let (f) = fj\\J . . .\J ifm be a 3-CNF formula, with variables xi, ... ,Xn- We 
construct a hypergraph 5^. Figure 1 shows the main part of the construction. 

Two or three boxes connected by a line network correspond to one hyperedge. 
Any “o” contained in a box is a unique element in X. An “o” or “s” contained 
only in one box is contained only in one set. Such a set has to be in layer 1, 
because the union of layer 1 contains the union of layer 2. A set containing 
elements all belonging to sets in layer 1, has to be in a layer yf 1. 

Associate layer 1 with TRUE and layer 2 with FALSE. Then the top part 
of Figure 1 containing the three sets labeled TRUE, TRUE, and FALSE, should 
be self-explanatory. It follows that any two sets labeled x and x in Figure 1 are 
in different layers, in any 2-layer Sperner decomposition. 

Assign either all the x-sets, or all the a;-sets to layer 1 for each variable x, 
thereby constructing a truth assignment. Among the x-sets and the x-sets, notice 
in Figure 1 that there are some containing an s-element. These sets are meant 
to correspond to literals in the clauses of (j). 

For each variable x with kx occurrences of literal x and k'x occurrences of 
literal x construct kx a:-sets with an s-element, and a;-sets with an s-element. 
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Fig. 1. Gadget for truth assignment 



Finally, three s-elements collapse to one if and only if the corresponding literals 
are in the same clause tpi- Therefore there is one s-element for each clause. 

Clearly a truth assignment satisfying every clause translates to a 2-level 
Sperner decomposition. Conversely, a 2-level Sperner decomposition correctly 
assigns truth value: Vx all the x-sets are in the same level, complement to the 
one with the x-sets. Moreover, every s-element belongs to three sets one of which 
in level 1, thereby satisfying the corresponding clause. 



4 Algorithms 

We designed a collection of algorithms that incrementally deal with more com- 
plex hypergraph structures. They provide a collection of subroutines from which 
the SPERNER- GREEDY- MERGE algorithm is constructed. The algorithm 
SPERNER-GREEDY-MERGE retrieves the Gonsecutive Ones Property for a 
hybridization matrix, which hints on the strength of the algorithm to deal with all 
different kinds of imperfections in physical mapping data. Moreover, SPERNER- 
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GREEDY-MERGE has approximation guarantees on sparse, biologically rele- 
vant data. Complete details of the algorithms are included in the Appendix sent 
to the Program Committee. 

The Merge-Sequence-Pair procedure is the basic building block of the al- 
gorithms. The algorithm merges pairs of already merged set collections. We 
say that a sequence of sets A = [Ai , , Ar] is a superstring collection for 
a set collection S = {Ci,...,C's} if for each i, 1 < i < s there are ji.kij 
\ < ji < ki < n such that Ci = Uji</<fei Ai^Am are disjoint for 

all ji < I < m < ki. li A and B are superstring collections for clone (set) 
collections and Di,...,Ds^, then Merge-Sequence-Pair Aads the 

optimal way of merging the two set sequences A and B into Merge{A,B), a su- 
perstring collection for {Ci, . . . , Di, . . . , Merge-Sequence-Pair requires 
that {Cl, . . . , CsJ^,Dl, . . . , Dsis} is Sperner, and respects the order of the sets in 
set sequences A and B. Merge-Sequence-Pair was designed to provide a way to 
merge efficiently, in an incremental greedy way, large collections of sets into one 
Q-node from which superstrings of the set collections can be obtained. 

The SPERNER- GREEDY-MERGE algorihtm uses the Merge-Sequence-Pair 
algorithm in a greedy way to construct superstrings. That is, all possible Merge- 
Sequence-Pair operations are performed, each time performing the one that 
yields the greatest overlap between the two structures that are merged. Each 
of the initial structures (superstring collections) consists of one clone from the 
data set. The SPERNER- GREEDY- MERGE algorithm assumes that the clone 
collection is Sperner. At the first step of the algorithm all the clone intersec- 
tion sizes are computed, and among the clone pairs that provide maximum 
intersections, one is chosen arbitrarily. This pair (call it C, D) is merged into 
a set sequence consisting of three sets, C\D,C U D,D\C. At each step, all 
new overlaps between the newly merged set sequence and the existing ones are 
computed. The pair to be merged is chosen arbitrarily among the ones with max- 
imum overlap. The algorithm runs till there is no possible merge with non-zero 
overlap. In the case that there is a non-repeating superstring for the initial set 
of clones, SPERNER- GREEDY- MERGE retrieves the PQ-tree of all possible 
non-repeating superstrings. 

The GREEDY-MERGE is dealing with Sperner levels, accommodating in- 
clusions from higher levels of the Sperner decomposition. GREEDY-MERGE re- 
trieves the ClP-property for arbitrary hypergraphs. It is a generalization of the 
PQ-tree CIP algorithm; it preserves the merges that are necessary for retrieving 
the consecutive ones property, performing them in a greedy fashion according 
to maximum overlaps. The GREEDY-MERGE algorithm uses the SPERNER- 
GREEDY-MERGE algorithm as a subroutine. 

The algorithm 2-PHASE-GREEDY is an approximation algorithm that works 
well on the strict Sperner hypergraphs. It achieves a 1.5625 worst-case ratio 
to the optimal solution. This algorithm is based on the SPERNER- GREEDY- 
MERGE algorithm, with some additional restrictions on the order in which the 
Merge-Sequence-Pair operations are performed. 
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5 Experimental Results 

We implemented the SPERNER-GREEDY-MERGE algorithm and ran it on 
randomly generated data. The data were generated according to the Lander- 
Waterman model, where clones are intervals of length 1 distributed uniformly 
along the interval [O,fV]0 The interval [0,N] was divided in lOOOfV discrete 
positions and probes were distributed along [0, A^] according to a Poisson process, 
except that for each clone C, a probe p was allowed to occur only once. Any 
occurrences of p in C after the first, were discarded. This distribution is very 
similar to a pure Poisson distribution if, as in our case, the mean arriving time 
of a probe is much greater than the length of a clone, which is 1 in our case. The 
hypergraph that was given as input to SPERNER-GREEDY-MERGE consisted 
of all the maximal generated clones. 

Table 1 displays some results of running the algorithm while varying N , 
the length of the interval where the clones are distributed; n, the number of 
clones used for generating the data, m, the number of probes used for generating 
the data; and A for exponential distribution of the arriving time of probes, p 
is the average number of probes after generating the data, Vavg is the actual 
average number of repetitions of probes, approximately = XN , and Vmax is the 
average over all generated sequences, maximum number of repetitions of a single 
probe. Lq is the average length of the generated sequences, and Lcm is the 
average length of the sequences or sequence fragments produced by SPERNER- 
GREEDY-MERGE. To facilitate presentation, the performance is presented in 
percentage of optimal that correspond to the ratio Lq/Egm- That is, when we say 
that the performance is 95.9% as on the table below in the experiment running 
with N = 20 and 300 probes, we mean that SPERNER-GREEDY-MERGE 
produces on average a superstring collection of total length 1.0428 x [length of 
the initial sequence]. 



N 


n 


m 


P 


favg 


^ max 




Lgm 


Performance 


5 


200 


200 


159.2 


1.6 


3.9 


259.1 


292.7 


88.7% 


10 


100 


200 


118.3 


1.4 


3.8 


165 


163.2 


100% 


10 


100 


200 


145 


1.5 


3.7 


216.5 


238.8 


90.7% 


10 


100 


200 


159 


1.7 


4.7 


268 


319.5 


84.2% 


20 


100 


200 


186.7 


2.4 


6.8 


451.3 


453.8 


99.5% 


20 


100 


200 


192.8 


3 


7.1 


555.3 


585.5 


94.9% 


20 


100 


200 


196.4 


3.4 


7.8 


660.3 


699 


94.5% 


20 


100 


300 


275.5 


2.4 


6.5 


638 


665.5 


95.9% 


30 


100 


300 


293 


3.3 


8.5 


951 


913 


100% 


30 


150 


300 


293 


3.3 


8.5 


969 


1041 


93.1% 


40 


200 


400 


397.5 


4.9 


12.5 


1886.5 


1937.5 


97.4% 



Table 1. Results on data generated according to the Lander- Waterman model. 



^ The clone beginnings are distributed along [0, A — 1] with uniform probability. 
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As can be seen, the major factor that seems to hurt the performance of the 
algorithm is the coverage of the gene, i.e. the average number of clones that cover 
each point in the interval [0, A^]. This indicates that a hypergraph that is Sperner 
decomposable in a few layers is easier to handle than one that is decomposable 
in many layers. This experimental observation is consistent with our intuition 
that the Sperner Decomposition problem captures the essence of the difficulty 
of computing minimal superstrings. High probe repetition also hurts the perfor- 
mance of the algorithm, as expected. The performance of the algorithm increases 
with the number of probes. Therefore the algorithm is expected to produce good 
results given that a sufficient number of probes is used in the experiment. Finally 
the performance seems unaffected as the number of clones increases. Occasion- 
ally the algorithm produces a shorter superstring than the initial superstring. 
This would correspond to experimental conditions where either too few clones, 
or too few probes are used, resulting in under-specified instances of the problem. 



6 Future Work 

Further research will focus on returning to the Lander- Waterman model to re- 
late the worst-case algorithmic approximability performance, to the probabilistic 
analysis of the algorithmic performance in the stochastic model. The mapping 
difficulties introduced by repeated probes as reported by the genomic centers for 
Human Chromosomes, e.g., the Human Y Chromosome, US| seem well captured 
by the combinatorial structure of our algorithms. We are planning a detailed 
experimental analysis of the performance of our algorithms on real data. 

On the theoretical side, it is an open question to prove a stronger inapprox- 
imability result for MIN-HYPERGRAPH-SUPERSTRING, or to demonstrate 
a constant approximation algorithm for the general problem. 
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Abstract. We infer post- hybridization rearrangements in a hybrid ge- 
nome, given the gene orders on its chromosomes and some knowledge of 
the two parent genomes. We study this in two biologically and computa- 
tionally different contexts, genome fusion and interspecific fertilization. 
Exact algorithms are furnished for some cases, and a heuristic based on 
the Hannenhalli-Pevzner theory for another. 



1 Introduction 

An important mechanism for the rapid emergence of a new, qualitatively dif- 
ferent species is the hybridization of two existing species. These parent species 
will generally be fairly closely related, but may have very different phenotypic 
expressions. There are actually several types of biological processes that give 
rise to hybrids, and these are perhaps most widespread in the plant kingdom. 
In this paper, we explore two such processes - genome fusion and interspecific 
fertilization. In the first case we give an exact, linear time algorithm for recon- 
structing the ancestral hybrid from knowledge of the modern genome and data 
about which gene came from which parent species. We then introduce additional 
data, on parental species gene order, and try to reconstruct two stages of hybrid 
genome evolution, intra- and intergenomic (referring to the haploid components 
originating from the two parents). We adapt the techniques of Hannenhalli and 
Pevzner m in a heuristic for separating these stages and give upper and lower 
bounds for the optimal transition point between them. 

In the case of interspecific fertility, we hypothesize that a key stage in the 
stabilization of the hybrid genome can be found by calculating the median of 
three diploid genomes, the two parents and the hybrid. We refer to a reduction 
of this problem |S| to the Traveling Salesman Problem. 

Definitions 

A genome G is a collection of N chromosomes Gi, • • • , Gn- A chromosome 
is a string of signed (-1- or — ) elements from a set £ of genes. Each gene in £ 
appears exactly once in the set of N chromosomes. 
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For string X = x\, ■ ■ ■ , Xm, we write —X for the inverted string —Xm, ■ ■ ■ ,—xi. 
We define the following rearrangement operations as in Figure^ Inversion, 
(or reversal) where any proper substring of a chromosome is inverted. (Inverting 
the entire chromosome only invokes an alternate notation for the identical chro- 
mosome, and does not constitute a rearrangement operation.) Translocation, 
where two chromosomes (one or both inverted), exchange prefixes of any length. 
A fusion is a translocation where one of the prefixes is the entire chromosome 
and the other prefix is null. A fission is a translocation where one of the starting 
chromosomes is the null string. Our analyses of translocations implicitly include 
fusions and fissions. 

chromosome 1 ^ 

abed translocation 

w X y z 

chromosome 2 + 



chromoso me 1' 
a b y z 
w X c d 
chromosome 2' 



II 

= inversion. 

w X a b c y z 



w x-c-b-a y z 



Fig. 1. Schematic view of genome rearrangement processes. Letters represent 
positions of genes. Vertical arrows at left indicate boundaries of affected sub- 
strings. Translocation exchanges prefixes of two chromosomes. Inversion reverses 
the order and sign of genes in a substring (dotted segment). 



2 Resolution of Tetraploidy; Ancestral Synteny Unknown 

One form of hybridization of two karyotypically distinct species sees the fusion of 
two genomes followed by a series of chromosomal rearrangement events until the 
hybrid genome is finally stabilized as a diploid (e.g. P]). The two homologous 
versions of each gene, one from each parent species, may diverge functionally 
to create a gene family. From the moment of hybridization till the present, the 
two parent species may also undergo chromosomal rearrangement. Thus we have 
direct access to neither the ancestral hybrid genome nor the two contributing 
strains. In this section we provide a method for reconstructing the ancestral 
hybrid, given the order of the genes on its chromosomes as well as data (obtained, 
for example, from sequence analysis) on which of these genes originated from each 
of the parent species. 

2.1 Formalization 

Consider two genomes A and B having disjoints sets of genes, £{A) and S{B), 
respectively. Let G be a third genome with N chromosomes and gene set £ = 
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S{A) U S{B). Given only £{A)^ £{B), and G, including how the genes are dis- 
tributed and ordered on the N chromosomes of G, the problem is to find d{G), 
the minimal number of inversions and translocations necessary to transform G 
into an ancestral hybrid genome H (with any number of chromosomes) sat- 
isfying the following condition: each chromosome of H contains genes from A 
only, or from B only. See Figure 0 



time of 
hybridization 



genome H 




£{B) 



present 



genome G 



Fig. 2. Evolution of a hybrid genome resulting from genome fusion when gene 
origins, but not ancestral genome organization, is known. Genome H is to be 
reconstructed from knowledge of genome G, and ancestral gene sets £{A) and 
£{B) only. 



2.2 Algorithm 

The following procedure solves this problem exactly in time linear in the number 
of genes. The output attains the lower bound of the type found by Watterson et 
al, except for certain special cases. 

— In each chromosome Gi of G, amalgamate each substring of consecutive 
A-origin genes to form an A-segment. Similarly form the B-segments. 
A-segments and i?-segments alternate along the length of the chromosome, 
separated by breakpoints. 

— Transform each chromosome with an odd number 6^ > 1 of breakpoints to 
a chromosome consisting of a single A-segment and a single i?-segment by 
means of inversions as follows. 

• While there remain at least 3 breakpoints, invert the fragment between 
the first and third breakpoints. Two A-segments are thus made adjacent 
and two i?-segments are made adjacent. 

• Erase the breakpoints between the two adjacent A-segments and between 
the two adjacent B segments, thus reducing the number of breakpoints 
by two. 
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~ Transform each chromosome with an even number 6^ > 2 of breakpoints to 
a chromosome consisting of either two ^-segments and a single i?-segment, 
or two B's and one A, by means of inversions as follows. 

• While there remain at least 4 breakpoints, invert the fragment between 
the first and third breakpoints. Two ^-segments are thus made adjacent 
and two i?-segments are made adjacent. 

• Erase the breakpoints between the two adjacent A-segments and between 
the two adjacent B segments, thus reducing the number of breakpoints 
by two. 

— Form as many pairs of ABA and BAB chromosomes as possible. Two 
translocations performed on each pair suffice to produce a homogeneous 
A chromosome and a homogeneous B chromosome, allowing the erasure of 
all four breakpoints. 

— Suppose some 2-breakpoint chromosomes remain and they are all ABA. 
They may be amalgamated two by two, each time with a translocation that 
produces a homogeneous A chromosome and an ABA chromosome, and al- 
lows the erasure of two breakpoints, until only one ABA remains. 

— Suppose instead of the previous step, the only 2-breakpoint chromosomes 
remaining are BAB. They may be amalgamated two by two, each time with 
a translocation that produces a homogeneous B chromosome and an BAB 
chromosome, and allows the erasure of two breakpoints, until only one BAB 
remains. 

— If there are any 1-breakpoint chromosomes, form as many pairs of them as 
possible. 

• If there are no 2-breakpoint chromosomes, transform each of the pairs of 
one-breakpoint chromosomes into one homogeneous A chromosome and 
one homogeneous B by means of a single translocation, and erase the 
two breakpoints. 

• If there is a 2-breakpoint chromosome, transform all but one of the pairs 
of chromosomes into one homogeneous A chromosome and one homoge- 
neous B by means of a single translocation, and erase the two break- 
points. Then two translocations suffice to transform the remaining pair 
and the 2-breakpoint chromosome into three homogeneous chromosomes, 
and to erase all four remaining breakpoints. 

— If 1- or 2-breakpoint chromosomes remain there are several cases: 

• If all that remains is a single 1-breakpoint chromosome, one transloca- 
tion (fission) is required to produce two homogeneous chromosomes and 
remove the breakpoint. 

• If all that remains is a single 1-breakpoint chromosome and a single 2- 
breakpoint one, two translocations (one a fission) are required to produce 
two homogeneous chromosomes and to remove all three breakpoints. 

• If all that remains is a single 2-breakpoint chromosome, two transloca- 
tions (fissions) are required to produce two homogeneous chromosomes 
and to remove both breakpoints. 



The output from this algorithm consists of homogeneous A chromosomes and 
homogeneous B chromosomes only, and the number of steps is 
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where •?■ = 0 except if the last step of the algorithm must be activated, i.e., 
when there are no chromosomes Gi of forms A ■ ■ ■ B or B ■ ■ ■ A, and an unequal 
number of chromosomes of forms A - ■■ A and B ■ ■ ■ B. Here, = 1. 

Note that there are generally many equally good solutions to this problem. In 
the next section, we reformulate the problem in order to pin down the structure of 
the ancestral genome somewhat. This will require additional data on the parent 
genomes and some assumptions about the amount of evolution in the hybrid 
compared to the purebred descendants of the parents. 



3 Resolution of Tetraploidy; Ancestral Synteny and Gene 
Order Inferred 



A second version of the hybridization problem uses the modern configurations 
A, B and G of the two parent genomes and the hybrid genome, respectively, to 
infer the three ancestral genomes A' , B' and G' at the moment of hybridization, 
as on the left of Figure 0 Note that G' consists of the chromosomes in A' plus 
the chromosomes in B' . 



H ^ A + B 





riA + riB 
intragenomic 
. only 

G_' 

= no 

intra- and inter- 
_ genomic 

G 



Fig. 3. Localization of ancestral hybrid immediately before intergenomic translo- 
cations 



As a first step, we infer the total number n of evolutionary steps required 
to produce G from a construct H consisting of the chromosomes of A and the 
chromosomes of as on the right of Figure 0 We assume that G' is one of the 
intermediate steps in this evolution so that n = nA + riB + n-c, where nx is the 
number of steps from genome X' to genome X, for X G {A, B, G}. 

Under the assumption that one of the first translocations to occur in the 
stabilization of the hybrid will be an intergenomic one, involving chromosomes 
from both A! and B' , we could locate G' at the last step on the path from H 
to G before the first intergenomic translocation, as on the right of the figure. 
Unfortunately, the optimal path is not unique, and there will generally be one 
optimum whose first step is intergenomic, so that ha + riB = 0 and n = uq- 
This may indeed be biologically meaningful in some contexts where hybrids 
evolve more rapidly than their parents. In other cases we may prefer to look for 
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the path where uq is as small as possible, to allow for a maximum of evolution 
in the parent species. 

It is this latter problem we investigate in this section. First we sketch the 
method of Hannenhalli and Pevzner hereafter “H-P” , for finding the mini- 
mum number of translocations and inversions necessary to transform one genome 
into another, and show how a heuristic for finding a minimal no solution to the 
hybridization problem may be grafted onto their algorithm. We then show how 
to calculate, relatively quickly, an upper bound for this heuristic based on one 
step of the algorithm. Finally, we construct a lower bound based on a breakpoint 
argument. 



3.1 The H-P Algorithm and a Heuristic for uq 

We will only sketch the H-P procedure, which is rather complex, and give ad- 
ditional details for those aspects which are modified in our heuristic. The first 
step in the comparison of two multi-chromosomal genomes through transloca- 
tions and inversions is to reduce it to a problem of comparing two single chro- 
mosome genomes through inversion only. These latter genomes are constructed 
essentially by concatenating the individual chromosomes in the original genomes 
end-to-end in an arbitrary order. (Additional dummy genes, called caps must be 
appropriately inserted at the ends of the original chromosomes of both genomes) . 
Translocation in an original genome becomes inversion in the new one. In the 
string representing a chromosome each gene -l-a; is replaced by the pair 
and —X by x^x* . 

To find the minimum inversions d{H, G) necessary to transform one single- 
chromosome genome H to another, G, H-P constructs a cycle graph, a bi- 
colored graph Q{V,E) with vertex set V containing x* and x^ for all genes in 
f , where black edges connect neighboring vertices in H, and gray edges connect 
neighboring vertices in G. Each vertex is thus adjacent to exactly one black 
edge and one gray edge. Q therefore has a unique decomposition into disjoint 
alternating cycles. We set h{Q) = \£\ — 1, the number of black edges of Q, and 
c{Q) to be the number of cycles of Q. Note that c{Q) is maximal when G = H. 
The size of cycle C is the number of black (or gray) edges in C. The inversion 
distance between H and G is then: 

d{H, G) = b{G) - c{G) + h{G) + f{G) (1) 

where h{G) is the number of hurdles in G, and f{G) = 1 if 5 is a fortress and 
/ = 0 if not. (These concepts will be discussed below.) 

A key concept in the algorithm is the orieuted couipoueut. A gray edge in 
a cycle is oriented if the inversion disrupting the two adjacent black edges, i.e., 

a adjacent to b in H, h adjacent to c in G, c adjacent to d in H 

becomes 

a adjacent to c in H, c adjacent to b in G, b adjacent to d in H, 
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replaces the cycle by two cycles. An oriented cycle is one containing at least one 
oriented gray edge. Two cycles whose containing gray edges that “cross”, e.g., 
gene i adjacent to gene j in Cycle 1, gene k adjacent to gene t in Cycle 2 in G, 
but ordered i, k,j, t in H, are connected. A component of G{V, E) is a subset of 
the cycles, built recursively from one cycle, at each step adding all the remaining 
cycles connected to any of those already in the construction. 

An oriented component has at least one oriented cycle. Hurdles are a partic- 
ular class of unoriented components. The entire graph G{V, E) is a fortress if a 
certain configuration of hurdles obtains. 

The H-P algorithm proceeds by decreasing h — c, the number of hurdles 
minus the number of cycles at each step. It handles each oriented component 
independently. If component C has yc black edges, and kq cycles, the algorithm 
proceeds to find a series of yc — nc inversions that reduces the component to a 
set of yc 1-cycles. 

Hurdles are treated somewhat differently. There is no inversion which will 
immediately increase the number of cycles in such a component. Instead, cer- 
tain hurdles undergo an inversion which changes them into oriented components, 
decreasing the number of hurdles by one and leaving the number of cycles un- 
changed “ hurdle “cutting” . Other pairs of hurdles are merged by means of an 
inversion that decreases the cycle count by one, but also decreases the number 
of hurdles by two. 

In each case, after the first inversion in a hurdle or pair of hurdles, the 
resulting configuration is an oriented component which may be reduced as above. 

Unoriented components which are not hurdles will eventually become ori- 
ented through inversions operating on other components, and may then be re- 
duced accordingly. 

Thus the execution of the H-P algorithm involves repeatedly choosing an 
oriented cycle and performing an inversion around an oriented gray edge, thus 
increasing the number of cycles by one, except for the first inversion whenever 
hurdles must be cut or merged. The strategy for our heuristic focuses on the 
successive choices of cycles and edges within cycles. The idea is to stop the 
reduction of an oriented component when there is no choice of cycle and edge 
within the cycle which corresponds to an intragenomic translocation or inversion 
(i.e. involving genes from A only or genes from B only). Similarly, if either the 
conversion of a hurdle to an oriented component, or the pairing of two hurdles, 
corresponds to an inter genomic transfer, it is postponed. 

This procedure is validated by the fact that each oriented component may be 
reduced independent of whatever inversions apply outside the component. Even- 
tually, when no more intragenomic translocations are possible, we have reached 
a locally maximum value of ua + tib (local minimum for no), the postponed 
reductions are re-started and the algorithm proceeds to an optimum solution. 

3.2 An Upper Bound for the Heuristic 

Suppose the decomposition of G{V,E) contains monogenomic oriented compo- 
nents Cl, - ■ ■ ,Cr (each involving genes from a single genome only, A or B). 
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The decomposition may also contain other components. If component Ci has 
7 Ci black edges and KCi cycles, the r components will be reduced by y = 
Si=i iCi — KCi inversions. Then 

d{H, G) - y> no, 

where nc is the value found by the heuristic. 

This bound can be improved in three stages: 

— By including, in the calculation of y, at least one monogenomic oriented 
cycle (if one exists) contained in each bi-genomic oriented component. 

— By including, for each bi-genomic oriented component not satisfying the pre- 
vious criterion, an intragenomic inversion (if one exists) around an oriented 
gray edge in a bi-genomic oriented cycle. 

— By repeating the above steps on certain hurdles whose treatment does not 
depend on the previous analysis of other hurdles. 



3.3 A Lower Bound for no 

Label the genes in G according to whether they come from A or i? as in Section 
121 and form segments of contiguous A’s and B’s. Suppose there are b breakpoints 
in all. Then at least [ |] translocations and inversions are required to remove 
these breakpoints, and these are necessarily intergenomic. I.e., < nc- 

4 Hybridization through Interspecific Fertility 

Hybrids may be formed by the fertilization event of two distinct though related 
species, an accident in nature but often feasible in the laboratory, e.g. HH 
The parent species A and B may differ from each other by numerous genome 
rearrangements. The hybrid G' is able to survive and propagate despite the 
difference between the two haploid components of its diploid genome. Genome 
rearrangement of the hybrid rapidly ensues, however, first until a normal sym- 
metric diploid configuration G* is attained, and then while further stabilization 
of the new genome occurs. This scenario is illustrated in Figure 21 The rapid 
evolution of the hybrid means that we can often assume the relative stability of 
the parent genomes A and B if the evolutionary scale is not too lengthy. 

Suppose that the rearrangements of the hybrid between G' and G* are in- 
tragenomic. I.e., the two hybrid genomes “conspire” to reorganize internally to 
a common form, before fixing any intergenomic translocations. Then the infer- 
ence problem which arises is to find G * , and the amount of rearrangement which 
occurred between G' and G* , and between G* and the modern genome G. 

This is essentially the “median problem” for genomes jOj : Given three genomes 
A, B and G, find the “median” G* which minimizes the sum of a genomic dis- 
tance between G* and A, G* and B, and G* and G. In general, this is a difficult 
problem, but in one case, namely when the distance is just the sum of the break- 
points between G* and each of the other three genomes, an algorithm is available 
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Fig. 4. Rearrangement before and after development of symmetric diploid. 



based on a reduction of the median problem to the Traveling Salesman Prob- 
lem, which functions well for genomes containing a fairly large number of genes. 

In the case where the rearrangements between G' and G* are intergenomic 
from the start, it is difficult to propose a general model; unrestricted rearrange- 
ments in this context allow, for example, two versions of the same chromoso- 
mal segment in one haploid component, and zero in the other, meaning that 
the models of meiosis and mitosis underlying genome rearrangement theory no 
longer apply. 

In the context of hybridization by interspecific fertilization, an additional 
type of data may be available. Genome typing informs us which chromosomal 
segments originate in which parental species (cf 0). This pattern derives from 
the normal recombination events in the production of gametes. It may or may 
not be the case that the genomic position of a segment is correlated with that of 
its homologue in the parental species from which it derives. This illustrates the 
difference between this mechanism of hybridization and that of Sections |3 and 
0 where genome fusion permits the retention of both of a pair of homologous 
genes, one from each parent. 

Discussion 

At least four aspects of the study of hybridization and rearrangement play a role 
in determining the nature of inference problem involved: 

- The biological mechanism - genome fusion, interspecific fertilization, or other. 

- The kinds of data available - present-day genomes, “ancestral” (i.e. stable 
or slowly-evolving) genomes, identification of genes in parental species (fusion 
model), or of segments (fertilization model). 

- Assumptions about relative rates of evolution and about types of rearrange- 
ment event permitted. 

- The entities to be inferred - events, syntenies, gene orders, beginning of in- 
tergenomic translocations. 

The more detailed the kinds of data, the more detailed the kind of recon- 
struction that is possible, and the less ambiguity (non-uniqueness) in the results. 
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For example, the analysis in Section |2| generally reconstructs a large number of 
optimal solutions, while the one in Section El will be less ambiguous. 

Each type of problem may require different tools from the inventory of meth- 
ods developed in recent years for the study of genome rearrangement. 

The most obvious domain of application of these methods is in the plant 
kingdom. The genomes of the cereals are particularly well-mapped and some of 
these show evidence of hybridization of the genome fusion type. The work of 
Rieseberg m illustrates the possibilities of the analysis in Section 0] As more 
genomic data becomes available, our methods should be more widely applicable. 
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Abstract. In sequencing by hybridization (SBH), one has to reconstruct a se- 
quence from its fc-long substrings. SBH was proposed as a promising alternative 
to gel-based DNA sequencing approaches, but in its original form the method is 
not competitive. Positional SBH is a recently proposed enhancement of SBH in 
which one has additional information about the possible positions of each sub- 
string along the target sequence. We give a linear time algorithm for solving the 
positional SBH problem when each substring has at most two possible positions. 
On the other hand, we prove that the problem is NP-complete if each substring 
has at most three possible positions. 



1 Introduction 

Sequencing by hybridization (SBH) was proposed and patented in the late eighties as an 
alternative approach to gel-based sequencing [4, 8, 15, 2, 9]. Using DNA chips, cf. [16], 
one can in principle determine exactly which fc-mers (fc-tuples) appear as substrings in 
a target that needs sequencing, and try to infer its sequence. Practical values of are 8 
to 10. 

The fundamental computational problem in SBH is the reconstruction of a sequence 
from its spectrum - the list of all A;-mers that are included in the sequence (along with 
their multiplicities). A naive approach to the problem is to look for a Hamiltonian path 
in a directed graph whose vertices correspond to fc-mers in the spectrum, and two ver- 
tices are connected if the {k — l)-suffix of one equals the {k — l)-prefix of the other. 
This is however a computationally hard problem. Pevzner [10] has shown that the re- 
construction problem can be reduced to finding an Eulerian path in another directed 
graph, an easily solvable problem. In that graph, vertices correspond to (A: - l)-tuples, 
and for each (c-tuple in the specfium, an edge connects the vertices corresponding to its 
(A:-l)-long prefix and suffix. 

The main handicap of SBH is ambiguity of the solution. Alternative solutions are 
manifested as branches in the graph (i.e., two or more edges leaving the same vertex), 
and unless the number of branches is very small, there is no good way to determine 
the correct sequence. Theoretical analysis and simulations [17, 1 1] have shown that the 
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average length of a uniquely reconstructible sequence using 8-mer chip is only about 
two hundred, way below a single read length on a commercial gel-lane machine. 

Due to the centrality of the sequencing problem in biotechnology and in the Human 
Genome Project, and due to its mathematical elegance, SBH continues to draw a lot of 
attention. Many authors have suggested ways to improve the basic method. Alternative 
chip designs iBi ri ini fBii as well as interactive protocols were suggested. An 
effective and competitive sequencing solution using SBH has yet to be demonstrated. 

Recently, several authors have suggested enhancements of SBH based on adding 
location information to the spectrum [BQO In positional sequencing by hybridization 
(PSBH), additional information is gathered concerning the position of the /c-mers in the 
target sequence. More precisely, for each fc-mer in the spectrum its allowed positions 
along the target are registered. The reduction to the Eulerian path problem still applies, 
but for each edge in Pevzner’s graph we now have constraints restricting its position 
in the Eulerian path. Mathematically, this gives rise to the positional Eulerian path 
problem (PEP): Given a directed graph with a list of allowed positions on each edge, 
decide if there exists an Eulerian path in which each edge appears in one of its allowed 
positions. Hannenhalli et al. [j^ showed that PEP is NP-complete, even if all the lists of 
allowed positions are intervals of equal length. Note that this leaves open the complexity 
of PSBH. They also gave a polynomial algorithm for the problem when the length of 
the intervals is bounded. 

In this paper we address the positional sequencing by hybridization problem in the 
case that the number of allowed positions per fc-mer is bounded, and the positions need 
not be consecutive. We give a linear time algorithm for solving the positional Eulerian 
path problem, and hence, the PSBH problem, in the case that each edge is allowed at 
most two positions. On the negative side, we show that the problem of PSBH is NP- 
complete, even if each fc-mer has at most three allowed positions and multiplicity one. 
We use in our hardness proof a reduction from the positional Eulerian path problem 
restricted to the case where each edge is allowed at most three positions. The latter 
problem is shown to be NP-complete as well. 

The paper is organized as follows: In Section El we define the PSBH and the PEP 
problems. In SectionElwe describe a linear time algorithm for the PEP problem when 
each edge has at most two allowed positions. In SectionElwe prove that the PEP prob- 
lem is NP-complete if each edge has at most three allowed positions. Finally, we show 
in Section 0 that the PSBH problem is NP-complete when each fc-mer is allowed at 
most three positions. For lack of space, some proofs are omitted. 



2 Preliminaries 

All graphs in this paper are simple, finite, and directed. Let D = {V, E) be a graph. 
We denote m = \E\ throughout. For a vertex r; G I/, we define ifs in-neighbors fo 
be the set of all vertices from which there is an edge directed into v. We denote this 
set by Nin{v) = {u : (u,v) € E}. We define the in-degree of v to be |A^i„(z))|. The 
out-neighbors Nout{v) and out-degree are similarly defined. 

Let E = {ei, . . . , Cm} and let P be a function mapping each edge of D to a non- 
empty set of integer labels from {1 . . . m} (its allowed positions). We call such a pair 
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{D, P) a positional graph. If for all e, |-P(e)| < k, then (D, P) is called a k-positional 
graph. Let tt = 7t(1), . . . , 7t(to) be a permutation of the edges in E. If tt defines a path, 
i.e., for each 1 < i < m, 7 t(z) = (u, v) and 7t(z + 1) = {v, w), for some u,v,w G V, 
then we say that tt is an Eulerian path in D. 

An Eulerian path tt in is said to be compliant with the positional graph (D,P) if 
7T“^(e) G P{e) for each e G E, that is, each edge in tt occupies an allowed position. 
The /c -positional Eulerian path problem is defined as follows: 

Problem 1 (k-PEP). Instance: A fc-positional graph {D, P). 

Question: Is there an Eulerian path compliant with (Z9, P)1 

Let E = {A,C,G,T}. The p-spectrum of a string A £ A"* is the multi-set of 
all p-long substrings of X. The problem of sequencing by hybridization is defined as 
follows: 

Problem 2 (SBH). Instance: A multi-set S of p-long strings. 

Question: Is S the p-spectrum of some string X ? 

Eor simplicity, we shall call the input multi-set a spectrum, even if it does not cor- 
respond to a sequence. The SBH problem is solvable in polynomial time by a reduction 
to finding an Eulerian path in Pevzner’s graph ITTl . More specifically, construct a graph 
D whose vertices correspond to (p — l)-long substrings of strings in S, and edges are 
directed from cti . . . tTp_i to tT 2 ■ • ■ for each cri . . . tjp G S. Then every solution to 
the SBH instance naturally corresponds to an Eulerian path in D. 

The positional 55// problem is defined as follows: 

Problem 3 (PSBH). Instance: A multi-set S of p-long strings. Eor each s £ S', a set 
P(s)C{0,...,|S|-l}. 

Question: Is S the p-spectrum of some string X, such that for each s £ S its position 
along X is in P{s)l 

If the set of allowed positions for each string is of size at most k, then the corre- 
sponding problem is called fc-positional SBH, or fc-PSBH. Ic-PSBH is linearly reducible 
to fc-PEP in an obvious manner. 

3 A Linear Algorithm for 2-Positional Eulerian Path 

In this section we provide a linear time algorithm for solving the 2-positional Eulerian 
path problem. A key element in our algorithm is reducing the problem to 2-S AT. To this 
end, the input is preprocessed, discarding unrealizable edge labels (positions). 

Let {D = {y, E), P) be the input 2-positional graph. Eor every 1 < t < m define 
A{t) to be the set of edges allowed at position t, A(t) = {e G E : t G P{e)}. Eor 
every vertex v G V define In{v, t) as the set of t-labeled edges entering v, In{v, t) = 
{(m, v) : (u, v) G A{t)}. Similarly define Out{v, t) = {(r;, u) : {v, u) G A{t)}. 

The first phase of the algorithm applies the following preprocessing step: 

while 3t such that A{t) = {e} (A{t) is a singleton), do: 

Suppose P{e) = 

Set A{t') ^ A{t') \ {e}. 

Set P{e) ^ {f}. 
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Lemma 1. The preprocessing step does not change the set of Eulerian paths compliant 
with {D, P). 

When implementing this step, we maintain current P(e) for each e, and A{f) for 
each t. If at any stage we discover that some set A{f) is empty, then we output False and 
halt, since no edge can he labeled t. The preprocessing phase can be implemented in 
linear time. We omit further details. In the following we denote by {D, P) the positional 
graph obtained after the preprocessing phase. The notation A refers to the resulting 
instance as well. 

Lemma 2. In (D, P) each position is allowed for at most two edges. 

Proof. The preprocessing ensures that if for some position t, \ A{f) \ = 1, then e G A(t) 
satisfies |P(e)| = 1. Let P be the set of positions t with |Z\(f)| = 1, and let r = \R\. 
Then there are m — r positions t for which |Z\(f)| > 2, and r' > r edges e with 
|P(e)| = l.Thus, 

2{m — r) |P(e)| — r = 2m — r' — r<2{m — r). 

t e 

Hence, r = r' and each label t ^ R occurs exactly twice, implying that |2\(f)| S {1, 2} 
for all t.m 

We say that vertex v infixed to position t in {D, P) if In{v, t) = A(f) or Out{v, t + 
1) = Z\(f + 1) . That is, any Eulerian path compliant with {D, P) must visit v at position 
t. Define Boolean variables X* for all t G P(e) |P(e) | = 2m— r variables in total). 



Define now the following sets of Boolean clauses: 

Xg for every e G E where P(e) = {t} . (1) 

Xg^ 0 Xg^ for every t ^ R where A{t) = {ei, € 2 } . (2) 

Xg^ 0 Xg^ for every e G E where P(e) = ■ (3) 

X*a,b) ^(b'c) t ^))) (^ + 1) £ ^((^> c)), & is not fixed to t .(4) 

X(„ „) for every t G P{{u, v)), t < m, s.t. Out{v, f 0 1) = 0 . (5) 

X(„ „) for every f G P((u,u)),f > 1, s.t. Jn(u,f - 1) = 0 . (6) 



Lemma 3. There is a positional Eulerian path compliant with (D, P) iff the set of 
clauses (0-® is satisfiable. 

Proof. Suppose that a satisfying truth assignment <1> exists. We shall assign an edge e 
to position t iff ‘P{Xf) —True. Clauses (H)) and 0 guarantee that exactly one edge is 
assigned to each position. Clauses o and (0) guarantee that each edge is assigned to 
exactly one position, and that this position is allowed to the edge. 

It remains to show that the above assignment of edges to positions yields a path in 
D. Suppose to the contrary that both X^*^ and g,^ are assigned True, with b b' . 
Then clauses © guarantee the existence of an edge (6, c) G A{t+f), while clauses © 
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guarantee the existence of an edge (a', b') G Therefore, b is not fixed to t, and 

a contradiction follows from clauses Hence, <P defines an Eulerian path compliant 
with (D,P). 

The converse can be shown in a similar way.a 
Theorem 1. 2-PEP is solvable in linear time. 

Proof. The preprocessing phase is linear. By Lemma|2the number of clauses 01)-® is 
0{m). Each XOR clause in Q)-0 and each equivalence clause in ® can be written as 
two OR clauses. Moreover, one can generate all clauses in linear time. By LemmaOlthe 
problem is reduced to an instance of 2-SAT which is solvable in linear time HU-- 

Corollary 1. 2-PSBH is solvable in linear time. 

4 3-Positional Eulerian Path Is NP-Complete 

In this section we prove that the 3-PEP problem is NP-complete by reduction from 
3-SAT. 

Theorem 2. The 3-PEP problem is NP-complete 

Proof. Membership in NP is trivial. We prove NP-hardness by reduction from 3-SAT. 
Let F be a 3-CNE formula with N variables . . . ,Xn, and M clauses Ci, . . . , Cm- 
We assume, w.l.o.g., that each clause contains three distinct variables, and that all 2N 
literals occur in F. Denote Xi = {xi} U {xi}. For a literal L G Xi, let gl denote 
the number of its occurrences in F. For 1 < j < define L{j) = (L,j), thus 
L(l), . . . , L{aL) is an enumeration of indices to the occurrences of L in F. For a clause 
C = L\J L'\J L" introducing the j-th occurrence of L (L', L" , respectively), we 

write C = L{j) V L'{j') V L”{j"). We shall construct a directed graph D = (E, E) and 
a map P from E to integer sets of size at most 3, such that F is satisfiable iff {D, P) 
has a compliant Eulerian path. 

4.1 Outline of the Construction 

We now provide a sketch of the main parts of the construction. For each occurrence of a 
variable in the formula, a special vertex is introduced. Special vertices corresponding to 
the same literal form a literal path. Two literal paths of a variable and its negation are 
connected in parallel to form a variable subgraph. For each clause in the formula, the 
corresponding special vertices are connected by three edges to form a clause triangle. 
Finally, for each special vertex we introduce a triangle incident on it, called its bypass 
triangle (see figureQJ. 

The sets of allowed positions are chosen so that they force every compliant Eulerian 
path to visit the literal paths one by one. A compliant Eulerian path corresponds to a 
satisfying truth assignment. When a special vertex is visited, either its clause triangle, 
or its bypass triangle are traversed. Traversing the clause triangle while passing through 
a certain literal’s path corresponds to this literal satisfying the clause. We make sure 
that for one of Xi and Xi, no clause triangle is visited while passing through its literal 
path. Eventually, we enable visiting all unvisited bypass triangles. 
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Fig. 1. A schematic sketch of the main elements in our construction. The figure includes 
three variable subgraphs, with the first variable (whose subgraph is rightmost) having 
three positive occurrences and two negated occurrences, etc. One of the clause triangles 
is also drawn, using dashed line. 



4.2 Construction in Detail 

We introduce the following vertices: 

- Ui,Ui for each variable Xi, 1 < i < N. 

- nL(j),i)i(j) for each occurrence L{j) of the literal L. is called special. For 
L G Xi, we shall denote Ui also by ^^(o), and Ui also by 

- r{Cc) for each clause Cc, 1 < c < M, identifying un as r{Co) and ui as 
r{CM+i)- 

We introduce the following edges: 

- For each clause C = L{j) V L'{f) V L"{j"), a clause triangle with the edges 

- For each occurrence L{j) of the literal L in the clause C, a bypass triangle with the 
edges {{vlu),vlU)), (Dl(j), r(C)), {r{C),VL{j))}. 

- A literal path /pot/i(L): {ui,VL{i)), (ul(i), fL( 2 )), (i’L( 2 ),fL( 3 )),- • • , w*), 

for each literal L G Xi. 

- For i = 1, . . . ,N, back edges {ui, Ui)\ for i = 1, . . . , N — 1, forward edges 

ifli , lii-l-l ) . 

- A finishing path (ujv, r(Ci)), (r(Ci), r(C 2 )), (r(C 2 ), r(C 3 )), . . . , (r(CM),Mi)- 

Figure Q shows an example of this constructed graph. The motivation for this con- 
struction is the following: Using the position sets, we intend to force the literal paths 
of the different variables to be traversed in the natural order, where the only degree of 
freedom is switching order between lpath{xi) and lpath{xi). This switch will corre- 
spond to a truth assignment for variable Xi, by assigning True to the literal in Xi whose 
Ipath was visited first. After visiting a special vertex along this first path, we either 
visit its clause triangle, or its bypass triangle. Along the other path (the one of the literal 
assigned False) only a bypass triangle can be visited. 
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Eventually, the finishing path is traversed. Its vertices are visited in the natural order. 
Upon visiting a vertex r(C), we visit only one bypass triangle - the yet unvisited triangle 
among those corresponding to the literals of clause C. 

We now describe the sets P{e). We will use the following notation: 

bi = Qxi + oj. for i = 1, . . . , N . 

i 

Bi = '^bj for i = 0,...,N {Bo = 0) . 
i=i 

BascL = Basej^ = BasCi = + 4(i — 1) for L G Xi . 

Alternate L = BasCi + 4a-j^+ 2 for L G Xi . 

ClauseBasCc = BascN+i + 4c . 




Fig. 2. An example of the construction for the formula {x\ V X2 V X3) A (xi V X2V X3). 
All large grey (black) vertices are actually the same vertex r(C'i) (r(C 2 )). 



- For each forward edge e = (ui-i,Ui), 2 < i < N, we set P(e) = {Basei}. 
This is intended to ensure that the literal paths are traversed in a constrained order: 
Ipath(xi) and lpath{xi) are allocated a time interval [BasCi + 1, Basei+i — 1] of 
length Abi + 3, during which they must be traversed. 
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- For each back edge e = {ui,Ui) wesetP(e) = {AlternatCxi, Alternatexi}■Ti^is 
enables either visiting lpath{xi) first, then e and lpath{xi), or visiting lpath{xi) 
first, followed by e and lpath{xi). 

- For each literal path edge e = (ttLO): with L G Xi, 0 < j < ql, we set 

P(e) = {Basei + 4j + 1, AlternateL + + 1}. Consecutive edges in a literal 

path are thus positioned 4 time units apart (allowing a triangle in-between). 

- For each clause C = Ci(ji) V ^ 2 (^ 2 ) V with the clause triangle {ci = 

62 = K2(i2)>^i303))> 63 = K3(i3)>^il0l))} SUCh thut Lfc G 
Xi^, define tk = BasCi^ + 4jk — 2 and set 



F’(ei) = {ti, fa -P 1, f 2 + 2} , 

P{^2) = {^2) fi + 1, fa + 2} , 

F’(ea) = {fa, f2 -P 1, fi -P 2} . 

This means that the edges of a clause triangle must be visited consecutively during 
the traversal of Ipath(Lk), for some k. Furthermore, note that this may happen only 
if Ipath(Lk) is traversed immediately after time BaseL^ , that is, only if it precedes 
Ipath(Lk) (see figure 0. 




Fig. 3. A clause triangle, with vertex denoted by k. The allowed positions for 

each edge appear in brackets. 



- For each finishing edge e = (r(Cc), r(Cc+i)), 0 < c < M, we set P(e) = 
{ClauseBasCc}, thus determining the order in which the vertices of the finishing 
path are visited, allowing a time slot [ClauseBasCc + 1, ClauseBasec+i — 1] of 
length 3, for the bypass triangle visited while traversing r-(Cc). 

- For a bypass triangle with the edges {e = e' = r(Cc)), 

e" = (r(Cc),'UL(j))}, we set: 

P{e) = {BascL + 4j — 2, AlternateL + 4j — 2, ClauseBasec + 2} , 
P{e') = {BaseL + 4j — 1, AlternateL + 4j — 1, ClauseBasec + 3} , 
P{e”) = {BaseL + 4j, AlternateL + 4j, ClauseBasec + 1} • 

This means that the bypass triangle edges must be visited consecutively, and there 
are three possible time slots for that: 
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• While traversing lpath{L), before traversing lpath{L). 

• While traversing lpath{L), after traversing lpath{L). 

• While traversing r(Cc) along the hnishing path. 

The reduction is obviously polynomial. We now prove validity of the construction. 

Suppose that F is satisfiable. We will show that (D, P) is a ”yes” instance of the 
3-positional Eulerian path problem. Let (/) be a truth assignment satisfying F. For 
each clause Cc, let Lc{jc) be a specihc literal occurrence satisfying Cc- 
We describe an Eulerian path tt in D. Set Tr{ClauseBasec) = (r(Cc), r(Cc+i)), 
for c = 0, . . . , M. Set Tr{Basei) = {ui-i,Ui), for i = 2, . . . , TV. For all i, if 
4>{xi) =True, set tt {Alternate = {iii,Ui). Otherwise, set Tr{Alternatexi) = 

For each literal L G Xp. 

• lf(j){L) =7>Me: For each 0 <j< ol, set 7r(i3asei+4j+l) = 

(see figureEl top). 




Fig. 4. Either a clause triangle or a bypass triangle must be traversed upon visiting a 
special vertex due to time constraints. Edge positions in case L is assigned True 
(False) are shown at the top (bottom). 



We further distinguish between two cases: 

* If L{j) = Lc{jc) for the clause Cc = L{j) V L'{j') V L''{j") in which 
L{j) occurs, then set tt to visit the edges of the clause triangle of Cc as 
follows: 



TT{BaseL + 4j - 2) = {vL{j),VL'{j')) , 

TT{BaseL + 4j - 1) = {vL'(f),VL"{j")) , 

■n{BaseL + ^j) = {vl"U"),VlU)) ■ 

Furthermore, in this case we set tt to visit the edges of the bypass triangle 
of L{j) as follows: 

Tr{ClauseBasec + 1) = {r{Cc),VL(j)) , 
Tr{ClauseBasec + 2) = {'VL(j)jVL(j)) j 
TT{ClauseBasec + 3) = {vL{j),i"{Cc)) ■ 
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* Otherwise, L(j) ^ Lc{jc) for the clause Cc in which L{j) occurs. In this 
case we set tt to visit the edges of the bypass triangle of L{j) as follows; 

Ti{BaseL +4j - 2) = {vL(j),VL(i)) , 

TT{BaseL +4j - 1) = (1)^^), r(Cc)) , 

Tr{BaseL + 4j) = {r{Cc),v^j)) . 

• If (/>(L) =False: For each 0 < j < ql, set 7r{AlternateL + 4:j + 1) = 
(fLO)) 'I'LO+i))- Furthermore, in this case we set tt to visit the edges of the 
bypass triangle of L{j) as follows (see figureEl bottom): 

Tr{AlternateL + - 2) = (vL(j),VL(j)) , 

■K{AlternateL + 4j - 1) = (ul(j), r(C'c)) , 
n{AlternateL + 4j) = (r(Cc), . 

Examining all the cases shows that tt is a permutation of the edges, and if 7r(/c) = 
(u, v),Tr{k + 1) = {u' , v') then v = u' . Hence, tt is an Eulerian path. Furthermore, 
by our construction tt is compliant with (D,P), proving that (D,P) is a ”yes” 
instance. 

Let TT be an Eulerian path compliant with (D,P). We shall construct an assign- 
ment (j) satisfying F. In order to determine cj>{xi) we consider the edge i:{Basei + 
1). By construction, ^{Basei -f 1) = for L G Xi. We therefore set 

4>{L) =Tme (and of course cj>{L) =False). We observe that for any other edge 
e' = (ulq), ul(j_|_i)) along lpath{L), we must have Tr{Basei + 4,j + 1) = e' iff 
4>{L) =True. 

We now prove that (f> satisfies each clause Cc = fji(ji)VL 2 (j 2 )VL 3 (j 3 ) in F. Con- 
sider the clause triangle of Cc- {ei = {vLi{ji),VL. 2 U 2 ))’ ^2 = (^^L 2 (i 2 )> 
es = Denote ffc = -f 4jfe - 2. Suppose that 7r“i(efc) ^ 

tk, for some 1 < A: < 3, then by the positional constraints the edge visited prior 
to €k must be in the clause triangle. Therefore, there exists some 1 < A: < 3 
for which Tr{tk) = e^. Furthermore, the edge e preceding Cfc in tt must have 
ffc — 1 = BasCLk + 4(jfc — 1) -f 1 G P{e). The only such edge entering is 

the literal path edge Therefore, (j>{Lk) =True, satisfying Cc- 

This proves that F is satisfiable iff {D, P) is a ”yes” instance, completing the proof 
of TheoremE!* 

Observe that the graph constructed in the proof of Theorem E| has in-degree and 
out-degree bounded by 4, giving rise to the following result: 

Corollary 2. 3 -PEP is NP -complete, even when restricted to graphs with in-degree and 
out-degree bounded by 4 . 

Henceforth, we call this restricted problem (3,4)-PEP. We comment that a slight 
modification of the construction results in a graph whose in-degree and out-degree are 
bounded by 2. 
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5 3-Positional SBH Is NP-Complete 

We show in this section that the problem of sequencing by hybridization with at most 
3 positions per spectrum element is NP-complete, even if each element in the spectrum 
is unique. The proof is by reduction from (3,4)-PEP. 

Theorem 3. The 3-PSBH problem is NP-complete, even if all spectrum elements are of 
multiplicity one. 

Proof. It is easy to see that the problem is in NR We reduce (3,4)-PEP to 3-PSBH. Let 
{D = {V, E), P) be an instance of (3,4)-PEP. Let k = [log 4 |R|] 2, p = 3k 1 and 

c = p 1. In order to construct an instance of 3-PSBH we hrst encode the edges and 
vertices of D. In the following, we denote string concatenation by |. We let cti =’A’, 
(T 2 =’C’, (73 =’G’ and (74 =’T’. 

For each n G H we assign a unique string in We add a leading ’T’ symbol 

and a trailing ’T’ symbol to this string, and call the resulting fc-long sequence the name 
of V. We also assign the (unique) sequence ’ A. . . A’ of length k to encode a space. Each 
vertex is encoded by a 3fc-long sequence containing two copies of its name separated by 
a space. We denote the encoding of v by en{v). Each edge {u, v) G Eis encoded by two 
symbols chosen as follows: Let Nout{u) = {ui, . . . , vi}, where v = Vi for some i, and 
I < 4. Let Nin{v) = {mi, . . . , Ur}, where u = Uj for some j, and r < 4. Then {u, v) 
is encoded by ai\aj, and we denote its encoding by en{u,v). We let EN{{u,v)) = 
en{u)\en{u,v)\en{v) (see figure|^. 



The name of v 



space 



The name of v 



unique sequence 



T A A • • • 



A T unique sequence T 



u J 



i(n) 



en[v} Gi Gj 



l{u) 



Fig. 5. The encoding of vertices and edges into strings. 



We now construct a 3-PSBH instance, i.e., a set S with position constraints T, as 
follows: For every edge {u, v) G E the set S contains all p-long substrings of the 2p- 
long sequence EN{{u, v)) (c substrings in total). Let denote the Lth substring, 

i = 0, . . . ,p. Let P{{u, n)) = {ti, . . . , f/}, 1 < I < 3, be the set of allowed positions 
for {u, v). Then we set = {c(ti — 1) -I- i, ... , c(f/ — 1) -I- i} for all i (substring 

positions are 0-up). 



On the Complexity of Positional Sequencing by Hybridization 



99 



Lemma 4. Each of the p-long substrings defined above is unique. 

Proof. Suppose that s = We first claim that i = j. There are two cases 

to examine: If s contains a space at positions r, . . . , r + A: — 1, 0 < r < 2A: + 1, then 
i = j = (c+k — r) (mod c). Otherwise, s begins with a run of ’A’ symbols of length 
0 < r' < A: — 1. This run belongs to a space in en{u) and en{w), and must be followed 
by the symbol ’T’. In this case i = j = 2k — r' . 

By construction, s contains a name of a vertex plus a unique symbol identifying an 
edge entering or leaving that vertex, implying that (u, v) = (w, z).m 

We now show the validity of the reduction. 

<;= Suppose that tt = {vq,vi),{vi,V 2 ), . . . , is a solution of the (3,4)-PEP 

instance. We claim that X = en{vo)\en{{vo, vi))\en{vx)\en{{v\,V 2 ))\. • ■ \en{vm) 
is a solution of the 3-PSBH instance. By Lemma 0 each p-long substring of X 
occurs exactly once in X. As tt visits all edges in D, we have that S is the p- 
spectrum of X. The fact that position constraints are obeyed follows directly from 
the construction. 

=» Let A be a solution of the 3-PSBH instance. Consider the m substrings of length p, 
whose starting positions are integer multiples of c. By the position constraints, the 
r-th such substring is an encoding of some vertex Vr, followed by a symbol ai^ . De- 
note by Wr the v-th out-neighbor of Vr. We prove that tt = (ui, wi), . . .,{vm, Wm) 
is an Eulerian path compliant with {D, P). 

Since each string in the p-spectrum of X is unique, tt is a permutation of the edges 
in D. To prove that tt is a path in D we have to show that Wr = Vr+i for r = 
1, . . . , m — 1. Let X be the p-long substring of X starting at position rc + 2k. 
We observe that x must begin with the last k symbols of en{vr), which compose 
name{vr), followed by ai^, some symbol, and the first 2A;— 1 symbols of en{vr+i), 
which contain name{vr+i) ■ The uniqueness of name{vr), name{vr+i) and the 
index v among the out-neighbors of Vr, implies that Wr = Vr+i- The claim now 
follows, since position constraints are trivially satisfied by tt.b 



Acknowledgments 

The first author was supported by the program for mathematics and molecular biology. 
The second author was supported by the Clore foundation scholarship. The third author 
was supported in part by a grant from the Ministry of Science, Israel. The fourth author 
was supported by Eshkol scholarship from the Ministry of Science, Israel. 



References 

[1] L. M. Adleman. Location sensitive sequencing of DNA. Technical report. University of 
Southern California, 1998. 

[2] R. Drmanac amd R. Crkvenjakov, 1987. Yugoslav Patent Application 570. 

[3] B. Apsvall, M. F. Plass, and R.E. Tarjan. A linear-time algorithm for testing the truth of 
certain quantified boolean formulas. Information Processing Letters, 8(3): 121-123, 1979. 



100 



Amir Ben-Dor et al. 



[4] W. Bains and G. C. Smith. A novel method for nucleic acid sequence determination. J. 
Theor. Biology, 135:303-307, 1988. 

[5] S. D. Broude, T. Sano, C. S. Smith, and C. R. Cantor. Enhanced DNA sequencing by 
hybridization. Proc. Nat. Acad. Sci. USA, 91:3072-3076, 1994. 

[6] S. Hannenhalli, R Pevzner, H. Lewis, and S. Skiena. Positional sequencing by hybridiza- 
tion. Computer Applications in the Biosciences, 12:19-24, 1996. 

[7] K. R. Khrapko, Yu. P. Lysov, A. A. Khorlyn, V. V. Shick, V. L. Florentiev, and A. D. 
Mirzabekov. An oligonucleotide hybridization approach to DNA sequencing. FEBS letters, 
256:118-122, 1989. 

[8] Y. Lysov, V. Floretiev, A. Khorlyn, K. Khrapko, V. Shick, and A. Mirzabekov. DNA se- 
quencing by hybridization with oligonucleotides. Dokl. Acad. Sci. USSR, 303:1508-1511, 
1988. 

[9] S. C. Macevics, 1989. International Patent Application PS US89 04741. 

[10] P. A. Pevzner. 1-tuple DNA sequencing: computer analysis. J. Biomol. Struct. Dyn., 7:63- 
73, 1989. 

[11] P. A. Pevzner and R. J. Lipshutz. Towards DNA sequencing chips. In Symposium on 
Mathematical Foundations of Computer Science, pages 143-158. Springer, 1994. LNCS 
vol. 841. 

[12] P. A. Pevzner, Yu. P. Lysov, K. R. Khrapko, A. V. Belyavsky, V. L. Florentiev, and A. D. 
Mirzabekov. Improved chips for sequencing by hybridization. J. Biomol. Struct. Dyn., 
9:399^10, 1991. 

[13] F. Preparata, A. Frieze, and Upfal E. On the power of universal bases in sequencing by 
hybridization. In Proceedings of the Third Annual International Conference on Computa- 
tional Molecular Biology (RECOMB ’99), pages 295-301, 1999. 

[14] S. S. Skiena and G. Sundaram. Reconstructing strings from substrings. J. Comput. Biol, 
2:333-353, 1995. 

[15] E. Southern, 1988. UK Patent Application GB8810400. 

[16] E. M. Southern. DNA chips: analysing sequence by hybridization to oligonucleotides on a 
large scale. Trends in Genetics, 12:110-115, 1996. 

[17] E. M. Southern, U. Maskos, and J. K. Elder. Analyzing and comparing nucleic acid se- 
quences hy hybridization to arrays of oligonucleotides: evaluation using experimental mod- 
els. Genomics, 13:1008-1017, 1992. 




GESTALT: Genomic Steiner Alignments 



Giuseppe Lancia*^ and R. Ravi**^ 

^ Dipartimento Elettronica ed Informatica, University of Padova, 
lanciaOdei . unipd . it 
^ GSIA, Carnegie Mellon University, 
ravi@cmu.edu 



Abstract. We describe GESTALT (GEnomic sequences STeiner ALign- 
menT), a public-domain suite of programs for generating multiple align- 
ments of a set of biosequences. We allow the use of either of the two popu- 
lar objectives. Tree Alignment or Sum-of-Pairs. The main distinguishing 
feature of our method is that the alignment is obtained via a tree in 
which the internal nodes (ancestors) are labeled by Steiner sequences 
for triples of the input sequences. Given lists of candidate labels for the 
ancestral sequences, we use dynamic programming to choose an optimal 
labeling under either objective function. Finally, the fully labeled tree of 
sequences is turned into into a multiple alignment. Enhancements in our 
implementation include the traditional space-saving ideas of Hirschberg 
as well as new data-packing techniques. The running-time bottleneck of 
computing exact Steiner sequences is handled by a highly effective but 
much faster heuristic alternative. Finally, other modules in the suite al- 
low automatic generation of linear-program input files that can be used 
to compute new lower bounds on the optimal values. We also report on 
some preliminary computational experiments with GESTALT. 



1 Introduction 

Comparing genomic sequences drawn from individuals of the same or different 
species is one of the fundamental problems in computational molecular biology. 
These comparisons can (i) lead to the identification of highly conserved (and 
therefore presumably functionally relevant) genomic regions, (ii) spot fatal mu- 
tations, (iii) suggest evolutionary relationships, (iv) help in correcting sequencing 
errors etc. Therefore, the mathematical formulation and solution of the Multiple 
Sequence Alignment problem has been and remains a fundamental challenge for 
computational molecular biologists. 

Aligning a set of sequences consists in arranging them in a matrix having 
each sequence in a row. This is obtained by possibly inserting spaces (gaps) in 
each sequence so that they all have the same length. The following is a simple 
example of an alignment of the sequences ATTCGAC, TTCCGTC and ATCGTC. 

* Most of this work was done when this author was visiting CMU during Summer ’98, 
under a grant from the GMU Faculty Development Fund. 

** Supported in part by an NSF CAREER grant CCR-9625297 



M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. lOl Hiidl 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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ATT-CGA-C 

-TTCCG-TC 

A-T-CG-TC 

There are many popular formulations of the alignment problem. The choice 
of the objective function for multiple alignments depends mainly on the presence 
or absence of extra input information in the form of a phylogenetic tree relating 
the sequences to their unknown ancestors. In fact, when such tree is given, 
knowledge of the ancestral sequences would imply the possibility of aligning the 
given sequences by progressively aligning each sequence to its ancestor in the 
tree all the way to the root and chaining these pairwise alignments together |Sj . 
Hence when a phylogeny is given, the tree alignment ( TA ) objective consists in 
finding the best ancestral sequences to label this tree and deriving the induced 
alignment. Guided by parsimony, the best labeling is taken to be one minimizing 
the total evolutionary change represented in the tree, namely, the total distance 
of all the edges in the tree. When the phylogenetic tree is not available, a popular 
multiple alignment objective is the Sum-of-pairs (SP) objective, which attempts 
to minimize the average distance between a pair of sequences in the multiple 
alignment. This objective results naturally by extending the alignment objective 
for pairs of sequences, namely, that of minimizing the edit-distance between the 
pair, to more than two sequences. The SP objective has been popular in the 
literature and several heuristic implementations addressing it proceed by first 
finding a heuristic tree spanning the sequences and aligning them progressively 
as mentioned earlier to obtain the final alignment. 

Historically, the SP objective is the one to which more attention has been 
devoted by computational biologists, and correspondingly a set of programs have 
been developed which are now widely in use. Among them, the only program 
that computes optimal SP alignments is MSA by Lipman, Altschul, Kececioglu, 
Gupta and Schaeffer iasi. A variety of other multiple sequence alignment pro- 
grams implicitly use the SP objective in guiding heuristic construction of the 
multi-alignments: An example is GLUSTAL V (see also the various meth- 
ods described in the surveys m for other examples). As for tree alignment, 
the only implementation that addresses this problem directly that we are aware 
of is the recent TAAR by Jiang and Liu \ I .'11 . This program implements some 
of the ideas from the approximation algorithms of Jiang, Lawler and Wang 
to heuristically compute tree alignments, phytogenies and generalized tree align- 
ments. 

In this paper we introduce and describe a new public-domain suite of pro- 
grams for multiple sequence alignment that produce heuristic alignments under 
both the TA and SP alignment objectives. Like TAAR, Our methods are based 
on ideas used in an approximation algorithm for tree alignment due to Ravi and 
Kececioglu HZ!. However, unlike the methods of Jiang, Lawler and Wang |2Z! 
on which TAAR is based, whose refined heuristics require very high running 
times, the ideas of Ravi and Kececioglu are based on mainly computing and us- 
ing Steiner sequences as candidates for the unlabeled ancestral sequences in the 
tree. Intuitively, a Steiner sequence for a given set of sequences is a “central” se- 
quence to them, one whose sum of distances to all these sequences is minimized. 
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Once these Steiner sequences for appropriate subsets of the input sequences have 
been computed, dynamic programming can be used to efficiently pick one such 
sequence for each ancestral node so as to minimize the total resulting distance in 
the tree, as in m- Thus, this method is adaptable for efficient implementation 
giving us the freedom to specify the subsets of sequences for which the Steiner 
sequences must be computed. Further, we can effectively adapt this general idea 
by modifying the dynamic program to provide an efficient heuristic even for the 
SP objective using the postulated Steiner ancestors. 

Further refinements in our implementation include incorporating the tradi- 
tional space-saving ideas of Hirschberg as well as some new data-packing 
techniques to reduce the space overhead; The running-time bottleneck in our 
method of computing exact Steiner sequences is effectively handled by a much 
faster heuristic alternative that has never shown more than two percent degra- 
dation in quality in our extensive preliminary testing. Finally, other programs in 
the suite allow automatic generation of linear-programming models as files that 
can be input to the popular commercial CPLEX package. The solution of these 
programs give lower bounds on the minimum TA and SP alignment values for 
the given set of sequences, thus providing the deviations from optimality on a 
case-by-case basis. 

We formally describe the various objectives and methods in the remainder 
of this section. In Sect. |^we give a high-level description of the algorithms in 
GESTALT, together with an analysis of the individual steps. In Sect.0we report 
on some experimental results on real data. 



1.1 Edit Distance 

At the heart of any alignment algorithm lies the procedure for optimally com- 
paring two given sequences. This problem is called pairwise alignment, and is 
formulated as follows. Given symmetric costs c(a, b) for replacing a symbol a with 
a symbol b and costs c(a, — ) for deleting (inserting) symbol a, find a minimum- 
cost set of symbol operations that turn a sequence S' into a sequence S" . It is 
well known that this problem can be solved by dynamic programming in time 
and space 0{P), where I is the length of the sequences. The value of an optimal 
solution is called the edit distance of S' and S" and denoted by d{S' , S"). 

An alignment A of two (or more) sequences is a way of inserting ” char- 
acters (gaps) in the sequences so that the resulting sequences have the same 
length. For two sequences S' and S" , the value dj,{S',S") of their alignment 
is obtained by adding up the costs for the pairs of characters in corresponding 
positions. It is immediate that d{S',S") = min dy\{S' , S") . 



1.2 The Sum of—Pairs Alignment Problem 

The SP score is the generalization to many sequences of the pairwise alignment 
objective, in which the cost of the alignment is obtained by adding the costs 
of the symbols matched up at the same positions. Analogously, in a multiple 
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alignment the cost is obtained by adding up the matching characters, over all 
the positions and for all the pairs of sequences. 

Minimizing SP is NP-hard m- In [3 Gusfield showed that a tree-based 
progressive alignment method due to Feng and Doolittle (described below) using 
the minimum cost star gives a 2-approximation. In the program described in this 
paper we push this idea further, by considering also trees that are not only stars 
and also employing alignments with sequences which are not in the original set, 
but are derived from it as Steiner sequences of some of the original ones. 



1.3 The Tree Alignment Problem 

In the tree alignment problem, we are given n sequences related by an evolu- 
tionary tree T. The sequences label the leaves of the tree, while the internal 
nodes correspond to the unknown ancestral sequences from which the others 
have evolved. The problem consists in finding the sequences at the internal nodes 
which minimize the cost of the tree, defined as X)(s s )gt ^ j )- When T is 
a star, the problem is called a Steiner problem, and the optimal sequence for the 
center is called the Steiner sequence for the leaves. 

The first exact algorithm for tree alignment was proposed by Sankoff in m, 
and is based on dynamic programming. Later Altschul and Lipman fP introduced 
some bounding rules to reduce the size of the dynamic programming lattice. 
Due to the prohibitive worst case complexity of exact methods, approximation 
algorithms for this problem were devised, by Jiang, Lawler and Wang m first, 
and improved by Wang and Gusfield I2ni later. In PH a 2-approximation method 
is described, based on what are called lifted alignments. In lifted alignments, the 
internal nodes can only be labeled by sequences occurring at the leaves. The 
running time of their algorithm is 0{n^P + n^) for a tree of n leaves of length 1. 
For trees of bounded degree d, they also provided the first PTAS for the problem. 
For any t, their approximation scheme guarantees a solution within a factor 1 -I- f 
of optimal, in time ^-i/d-iy 

For regular d-ary trees on n sequences, Ravi and Kececioglu gave in m a 
^^-approximation algorithm with running time roughly (0{2kn)‘^) - the main 
ideas of their algorithm are briefly described in Sect.0 The program GESTALT 
described in this paper is the first implementation of the ideas in C3- 



1.4 A Tree-Based Progressive Alignment Method 

A reasonable requirement on the cost function is that c(a, a) = 0 Va, and it 
obeys triangle inequality. In this case, the edit distance induces a metric over 
the space of all sequences and, given n sequences, we can talk of graphs having 
the sequences as vertices and for which an edge is weighted by the edit distance 
between the endpoints. In this setting, graph theoretical concepts such as span- 
ning trees, stars and Steiner points, have been widely used in the design and 
analysis of effective alignment algorithms. In particular, a folklore approach to 
multiple alignments is due to Feng and Doolittle and shows how we can use 
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any tree to align a set of n sequences. The appeal of the approach is that for 
n — 1 out of n{n — l)/2 pairs, the pairwise alignment induced is in fact optimal. 

Proposition 1. For any tree T over a set of sequences, there exists a multiple 
alignment A{T) of the sequences such that d^(T)(<S'^ 5"') = d{S',S") for all the 
pairs of sequences {S' , S") connected hy an edge of the tree. 

Feng and Doolittle’s method can be used to turn the solution of the tree 
alignment problem, namely a labeling of the internal nodes of the given tree, 
into a multiple alignment of the leaves. Moreover, it is straightforward to upper 
bound the distance in this alignment of pairs that are not endpoints of a tree 
edge. In fact, denote by d{S' , S" ,T) the length of the path in T between two 
sequences S' and S" . Then, by triangular inequality we have that dAiT){S',S")< 
d{S' , S" ,T). This inequality suggests that, given a tree with sequences at the 
leaves for which we want to minimize average pairwise distance in the resulting 
multiple alignment, a good labeling for the internal nodes is one which minimizes 
the total inter-leaf distance in the tree. This strategy is adopted in this work to 
obtain alignments of small SP value, as described in|^l 

1.5 GESTALT Program Suite 

In this paper we describe the program GESTALT (GEnomic sequences STeiner 
ALignmenT), which can be used for both TA and SP multiple alignments. 
GESTALT is in fact a program suite, including modules for computing LP-based 
lower bounds for TA and SP, and optimal alignments of two or three sequences. 

The main program takes as input a set C = {S\, . . . , Sn} of n sequences 
and possibly a tree T of which C are the leaves. If the phylogenetic tree is not 
available, the algorithm internally computes one, which is then used to find 
an alignment of small SP value. If the tree is given, then the TA objective is 
optimizecOl. The output of the algorithm consists of a multiple alignment of 
the input sequences, plus some extra information, such as the Steiner sequences 
computed at the internal nodes of the phylogenetic tree. 

GESTALT is based on the ideas introduced by Ravi and Kececioglu in [IZj of 
using Steiner sequences of the leaves to label the internal nodes of the tree. While 
in their paper Ravi and Kececioglu show that if the tree is d-ary the method gives 
a approximation for TA, in our work we do not restrict the degree of each 
node to a constant. Therefore we do not have the same approximation guarantee. 
However, among all the labelings considered is included the best lifted labeling 
of PI| and therefore we still have a performance guarantee of 2 for the TA 
objective. As is typically the case, this bound turns out to be largely pessimistic 
and our computational results show that the algorithm performs much better in 
practice. 

The 2-approximation guarantee holds also for the SP alignments we output. 
Recall that we include, among all the labelings considered, one in which the 

^ The choice of the objective in the presence or absence of the tree can also be user- 
specified 
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internal nodes of the tree are all labeled with any leaf S. For this particular 
labeling, the resulting tree is equivalent to a star centered at S, and as remarked 
before |3, the best star centered at a leaf gives a 2-approximation. 

2 Procedure Overview 

Our program is largely based on a heuristic procedure by Ravi and Kececioglu 
(ini) for solving the tree alignment problem. Their algorithm relies on labeling 
the internal nodes with Steiner sequences for subsets of p leaves, where p is a 
parameter. The procedure is divided in two phases. In the first phase a Steiner 
sequence is computed for every subset of g < p leaves, obtaining a set T of all 
such Steiner sequences. In the second phase, dynamic programming is used to 
compute the best labeling of the internal nodes among those in which only labels 
from T are allowed. 

In this work, we have decided to solve the TA problem by employing Ravi 
and Kececioglu’s algorithm, with the following variants: (i) Because computing 
exact Steiner sequences is expensive, we have limited the size of the subsets 
for which a Steiner problem is solved to p = 3. (m) In addition to Sankoff’s 
exact algorithm for Steiner sequences, with complexity 0{P), we also use a 
heuristic algorithm, with average (empirical) complexity 0{P). (Hi) We do not 
necessarily compute the Steiner sequences for all the (g) possible triples of leaves, 
but provide alternate, heuristic methods of sampling significant triples, {iv) We 
also perform a final re-optimization step, as introduced by Sankoff et al ( EDI). 

Our program can be used also to optimize SP. In this case, we first compute a 
tree having the given sequences for leaves and then assign tentative labels to the 
internal nodes by using Steiner sequences, as for the TA objective. In choosing 
the best label at each node, however, we use dynamic programming to minimize 
the total leaf-to-leaf distance in the tree, which is an upper bound on the final 
SP score. A final reoptimization phase can be run to improve the alignment. 
The outline of our multiple alignment heuristic procedure is given below. 

1. Tree computation. 

— TA: none (the tree is given). 

— SP: We compute a phylogenetic tree having the given sequences as leaves 
- this is derived from a MST on the sequence graph. 

2. Solution of Steiner problems. We tentatively assign to each of the internal 
nodes of the phylogenetic tree a set of labels, given by the Steiner sequences 
of some subsets of the leaves. 

3. Optimal labeling by Dynamic Programming. We find for each internal node 
the best sequence among those in its set of possible labels. 

— TA: The objective is to minimize the total tree-length. 

— SP: The objective is to minimize the total leaf-to-leaf distance in the 
tree. 
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4. Local re-optimization. 

— TA: At each node of degree three we replace the current sequence by 
the Steiner sequence of its neighbors. We iterate as long as there are 
improvements. 

— SP: (after step 5.) We iteratively break up the alignment into two sub- 
alignments that are then realigned optimally. The subalignments chosen 
have a large average difference in the current value versus the edit dis- 
tance. 

5. Final alignment hy Feng and Doolittle. We compute a multiple alignment of 
all the resulting sequences (both leaves and internal nodes) by the progressive 
alignment method of Feng and Doolittle. 

We elaborate on some of these steps next. 



2.1 Tree Computation 

In order to derive a phylogenetic tree T relating a set of sequences when one is 
not input, we use a simple greedy approach. We start with T being a minimum 
cost spanning tree of the edit distance graph. Let (u, v) be the largest cost edge 
of T. Break up T by deleting edge (u, v) into two trees T„ containing u and T„ 
containing v. Recursively, apply the same procedure to T„ and T„, obtaining two 
new trees, T„/ and Ty> rooted at new nodes u' and v' respectively. Finally, join 
these two subtrees by means of edges {u' ,w) and (v',w) to a new root node w, 
thus obtaining the final phylogenetic tree. 



2.2 Solution of Steiner Problems 

Choice of Steiner Sequences Given a set of possible sequences (labels) for 
each internal node of the tree, choosing the best label is done by dynamic pro- 
gramming (described in 12., Ij) and is very fast in practice. On the other hand, 
computing the labels is very expensive. Therefore once some labels have been 
computed, it is convenient to store them at every internal node, i.e. all the nodes 
will have the same set Q of labels. As previously noted, the labels allowed at the 
internal nodes will only be Steiner sequences for some subsets of g < 3 leaves. 
When 9 = 1 or 2, a Steiner sequence is simply a leaf, so that it will always be 
G — C U Q' , where G' is a set of Steiner sequences for some triples of leaves. 
Let us denote by Y{Si, Sj,Sk) a Steiner sequences for the triple {Si, Sj,Sk). We 
allow three possibilities for G'- 

— = 0. In this case the internal nodes are labeled with leaves sequences 

only. This option results in the fastest running time, but may produce poor 
final alignments, especially when the given sequences are very dissimilar. 
Note that among the alignments based on these labels are included all lifted 
alignments jzg for TA. Similarly, these labels contain also all star alignments 
for SP. 
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— Q' = {Y {Si, Sj, Sk) : i < j < k}. This is computationally the most expensive 
option, since it requires the solution of (g) Steiner problems. On the other 
hand, the larger set of possible labels at the internal nodes guarantees a 
better value of the final alignment. 

— Let Si, S 2 , ■ ■ ■ , Sn be the sequence of leaves as encountered by performing 

a depth-first visit of the tree. Then, Q' = {Y{Sj,Sk,Sh) : h = k + I = 
j + 2 or h = k + A = j + 2A} where A — . The intention is to heuristi- 

cally obtain a uniform sampling by selecting triples of leaves from different 
positions in an Euler tour of the tree. This option is quick -there are only 
0 {n) such triples- but ensures that each sequence is included in some triples, 
and that all the sequences are given the same representation in the samples. 



Exact Steiner Sequences Assume we are interested in finding a Steiner se- 
quence for three sequences U\, U2 and C/ 3 . The dynamic programming procedure 
computes the optimal alignment of the variable Steiner sequence and Ui, U2 and 
C/ 3 . This is done backwards from the final column of the alignment, which will 
be of the form {x\,X 2 ,x^,y)' , where each Xi is either the last letter of the se- 
quence Ui or a blank (but at least one Xi must be nonblank), and y is any 
nonblank letter of the alphabet S (representing the metter in the Steiner se- 
quence being constructed). For any letter x, define 1 ■ x = x and 0 • a; = — . Let 
= {0, 1}^ \ (0, 0, 0) be the set of nonnull binary 3-vectors and let V {h, I 2 , h) 
be the cost of an optimal Steiner sequence for the the first h, I2 and characters 
respectively of U\, C /2 and C/ 3 . The recursive dynamic programming relation is 
then 



V{h,l2,h) = min Iv {h - bi,h - h 2 ,h - h) + TUmy^ c{hi ■ U,,[li],y) 

h^B+ ' 

t i=l 

The Steiner sequence is given, as customary in dynamic programming, by 
backtracking through the values V{l\,l 2 ,h) along the path for an optimal so- 
lution and listing the letters y = arg min^^^g c{bi ■ Si[li],y) which achieve the 
minimum in the above expression. Note that the above recurrence requires time 
and space complexity of 0 {lU), provided that for all {x\,X 2 ,X'i) G the val- 
ues C{x\,X 2 , X 3 ) := minygi; X)i=i y) have been computed in a preliminary 
step and stored in a look-up table. In our implementation we have reduced the 
space complexity to 0{P) for the matrix V{i,j,k) using ideas from ^ 21 - 



Heuristic Steiner Sequences Computing exact Steiner sequences is very time 
consuming. For instance, the solution of a problem on sequences of about 200 
letters each takes roughly half minute on a Pentium PC. Considering that for 
aligning 10 sequences we may have to solve (3°) = 120 such problems, we see that 
speeding up the computation of Steiner sequences would be greatly beneficial. 
Therefore, we have devised an alternative, heuristic way of computing Steiner 
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sequences which is extremely fast and turns out to be almost-optimal after 
extensive testing (see Sect. 0. 

The idea is to first find all optimal alignments of two of the three sequences, 
say Si and S' 2 . They correspond to all the shortest paths from (0, 0) to (|S'i|, |S' 2 |) 
in the |S'i| x |5'2| dynamic programming lattice used for the pairwise alignment, 
and can be represented in a compact form as the subgraph of the lattice of 
all the edges on some optimal path. Note that this subgraph is typically much 
smaller than the whole lattice (empirically, 0{l) versus 0{P)). Then, we perform 
a graph-to-sequence alignment, i.e. we find the best completion of an optimal 
alignment of Si and S 2 with S 3 . In this case, “best” is taken with respect to the 
Steiner objective. 

The value of the final solution may depend on the ordering of the sequences, 
since S 3 is clearly used differently than and S' 2 . We have observed in our 
experiments that choosing Si and S 2 to be the two closest sequences results 
in the best Steiner sequences over the three possible choices. However, since 
the algorithm is very fast, we compute all three possibilities of first aligning 
together two sequences and then versus the third, and return the best solution 
found. We conclude this section by remarking that the computation of heuristic 
Steiner sequences takes on the average one second for sequences of length 200, 
while returning a solution whose value was never more than 2 % larger than the 
optimum in our extensive testing. 

2.3 Optimal Labeling by Dynamic Programming 

In this section we consider the problem of optimally assigning a sequence from 
a given set Q to each internal node of the tree. Denote by wi, . . . ,Wt the nodes 
which are immediate descendants of a node i. Let V{i, S) be the optimal value 
for the subtree rooted at i when node i is labeled with a sequence S G G- We 
have the following dynamic programming recurrence: 



The coefficients allow us to distinguish between the two objective 

functions - TA and SP. For the TA objective, V{i,S) represents the minimum 
total length of the subtree, among the labelings that assigns S to i. This is 
obtained by setting all the A equal to 1. For the SP objective, we want to find 
the labels which minimize the total leaf-to-leaf distance. For any edge (u,v) of 
T, we set X(u,v) to be the number of pairs of leaves whose connecting path in 
the tree goes through (u,v). This value, called the load of the edge, is equal 
to k(ji — k), where k is the number of leaves on one shore of the cut identified 
by (u,v). By using the loads, the total leaf-to-leaf distance can be rewritten 
8'® J 2 si,Sj d{Si,Sj,T) = J 2 (u,v)gt \{u,v)d{L{u),L{v))^ where L{u) and L{v) are 
the sequences labeling nodes u and v. 

Using the above relation, first the value of each label at each node is com- 
puted bottom-up, and later, proceeding top-down from the root, it is determined 
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which label to pick at each node for obtaining an optimal solution. The overall 
complexity is 0(n|t/p), i.e. a very fast procedure. 



2.4 Reoptimization 

The reoptimization for TA objective is the same as in Sankoff et al j2Dl. For 
SP, however, we use a new approach. As in other works (e.g. 0) we repeatedly 
break up the alignment into two pieces that are then realigned optimally via 
the basic dynamic program for edit distance. The new idea relies in how these 
alignments are chosen. Since for each pair of sequences in the same subalignment 
the distance remains the same, the only improvement can be for sequences that 
are in different subalignments. Let 6{S,S') = dy[{S, S') — d{S, S'). If and 
A 2 are the subalignments, (5(Ai, A 2 ) = J^seAi S'eAi is the i5-value of 

the cut (^ 1 ,^ 2 ) in the graph of all sequences, and 5{A\^A2)/\Ai\\A2\ is a per- 
sequence measure of how bad the alignment currently is versus the lower bound 
given by the edit distance. Hence we want to reoptimize some cuts of high (per- 
sequence) value, which we find through standard greedy heuristics. We have 
different settings on how far the reoptimization phase can be pushed. In the 
most expensive setting, for each pair (S', S') of sequences we find a large-value 
cut separating them and relign it. We iterate as long as there are improvements. 

3 Computational Experiences 

For our preliminary tests, we used two popular data sets. First, we obtained 
the sets of protein sequences of Me Clure m, used extensively to benchmark 
programs guided by the SP objective. For the Tree Alignment problem, we have 
used a famous instance by Sankoff et al EDI, used as a benchmark in im. 

As for the cost matrix, in our experiments we have used a distance matrix 
due to Taylor for amino acid sequences, and the matrix in Sankoff |2()| for 
DNA sequences. Our program also works with all the common score matrices 
(e.g. PAM, BLOSUM, etc). 

1. Lower Bounds. A unique feature of the GESTALT suite is a procedure 
to generate linear programming (LP) based lower bounds on the TA and SP 
objective values of the given instance by using the Steiner sequences for triples 
computed so far. We describe the LP for the TA problem. We use a nonnegative 
variable for the length of every edge of the tree, and the objective is to minimize 
the sum of lengths of all tree edges. A distance of d between a pair of leaves 
Si and Sj allows us to add the constraint that the sum of the values of the 
edge lengths on the path between Si and Sj in the tree must be at least D. 
Similarly, given a value of T A{i,j, k) for the minimum sum of the distances from 
an optimal Steiner sequence for the triple (Si, Sj, Sk) to the three sequences 
Si, Sj and Sk, we add the constraint that the sum of the lengths of all the edges 
in the tree induced by the three leaves Si, Sj and Sk must be at least TA{i,j, k). 
The objective function in the LP is to minimize the sum of the values of the 
edge variables. The set of constraints for distances between pairs of leaves was 
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Table 1. Heuristic vs exact Steiner sequences. Times in seconds, Pentium 
133Mhz 



instance 


tot 

seqs 


tot 

triples 


relative 

error 


time 

exact 


time 

heuristic 


avg 


min 


max 


min 


max 


min 


max 


sank 


9 


84 


0.003 


0 


0.02 


15.8 


41.0 


0.6 


1.9 


mc582x6 


6 


20 


0.004 


0 


0.01 


52.3 


75.6 


0.5 


3.0 


mc586x6 


6 


20 


0.007 


0 


0.017 


17.8 


42.5 


0.6 


2.1 


mc587x6 


6 


20 


0.01 


0.003 


0.019 


29.2 


71.9 


0.8 


2.7 



experimented with in Cni: while the strengthening to triples gives better bounds 
as reported below. 

For the SP objective for multiple alignment, a simple averaging argument us- 
ing the usage of Steiner triples yields a simple lower bound of ^ ■ - f,SP{i, j, k) / 
(n — 2) for n sequences, where SP{i,j, k) denotes the optimal sum-of-pair value 
for the triple Si , Sj and Sk ■ This may be further extended to a LP lower bound 
with one nonnegative variable for the distance between every pair of sequences 
in the multiple alignment. The constraints now require that for every triple 
Si, Sj, Sk of distinct sequences, the sum of the values of the three variables in- 
volving the three pairs from this triple must be at least SP{i,j, k). The objective 
is to minimize the sum of all the variables over all pairs of sequences. 

2. Steiner Sequences. First, we determined the quality of heuristic vs exact 
Steiner sequences. The results are reported in Table 1. For these tests, we used 
four data sets, i.e. the sequences from Sankoff and three sets of sequences from 
McClure. These sequences have between one hundred and two hundred letters 
each. For each set, we have computed for each triple the exact and heuristic 
Steiner sequences, and compared the relative errors. It should be noted that 
on these sequences, the heuristic is roughly thirty times faster than the exact 
procedure, while the average error is less than one percent. A striking result was 
that in 41 out of 84 triples for the sank instance, the heuristic solution was in 
fact optimal. 

3. Tree Alignment. A second experiment was performed to access the 
quality of the solution to the Tree Alignment problem, and the relative per- 
formance with different settings of the program. We have run GESTALT on 
Sankoff’s problem with all possible combinations of user choices. The results 
are reported in Table 2. Again, it should be noted that using heuristic Steiner 
sequences is greatly beneficial to the computing time, and, since the whole pro- 
cedure is heuristic in nature, can even lead to better solutions than the exact 
option. This is indeed the case here. 

In order to evaluate the quality of the results, we have computed the lower 
bound on the problem by using our LP module. The LP lower bound based 
on all the Steiner sequences of triples for the TA objective is 266.375 improving 
over the best bound of 253.5 previously known m- The optimal lifted alignment 
finds a value of 364, as also reported in mu. Using heuristic Steiner sequences. 
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Table 2. TA results on the instance sank. Times in seconds, Pentium PC 



Triples 


Steiner 


Reopt 


Value 


Time 


ALL 


HEUR 


EXACT 


302 


592 


ALL 


HEUR 


HEUR 


302.25 


424 


ALL 


EXACT 


EXACT 


303.25 


2802 


SOME 


EXACT 


EXACT 


304 


493 


ALL 


EXACT 


HEUR 


304.25 


2599 


SOME 


EXACT 


HEUR 


304.5 


267 


SOME 


HEUR 


EXACT 


314 


201 


SOME 


HEUR 


HEUR 


315.75 


23 


NONE 


- 


EXACT 


320 


152 


NONE 


- 


HEUR 


320.5 


6 


ALL 


HEUR 


NONE 


322.25 


298 


ALL 


EXACT 


NONE 


322.5 


2387 


SOME 


EXACT 


NONE 


333.5 


258 


SOME 


HEUR 


NONE 


333.75 


15 


NONE 


- 


NONE 


364 


1 



we find a solution of value about 302 in about 7 minutes. Contrast this with 
the best upper bound of 295.5 by Sankoff et al J2(l| Our improved lower bound 
shows that Sankoff’s solution is within 11% cffliptimal. 

4. Sum of Pairs. For the SP objective, we report some results for the 
McClure data sets (Table 3). For each problem, we have computed the trivial 
lower bound given by the sum of edit distances, and two lower bounds based on 
the optimal SP alignment of triples of sequences - one uses a simple averaging 
argument (LB triples) and the other the solution to an LP relaxation (LB Ip). 
We ran GESTALT with heuristic Steiner sequences, sampling all triples. Our 
solutions are in an interval of 2 to 9 percent from the lower bound. The table 
shows also the effectiveness of local reoptimization. For comparison, we also 
report the SP value of the star alignment (CnsfielEj [H]). 



Table 3. SP lower and upper bounds for McClure data sets 



Instance 


LB pairs 


LB triples 


LB Ip 


Star align. 


GESTALT Err % 


GESTALT+reop Err % 


mc582x6 


25411 


26056 


26100 


28444 


27647 


0.06 


26963 


0.03 


mc586x6 


25191 


25979 


26029 


29307 


28605 


0.10 


27498 


0.05 


mc587x6 


29914 


30802 


30864 


34085 


34152 


0.11 


32664 


0.05 


mc582xl0 


70718 


72274 


72757 


82011 


77676 


0.07 


75131 


0.03 


mc586xl0 


81745 


84211 


84662 


99140 


97725 


0.15 


91754 


0.08 


mc587xl0 


95002 


97889 


98349 


115918 


110463 


0.12 


105806 


0.07 


mc582xl2 


98810 


100720 


101464 


113328 


105674 


0.04 


103803 


0.02 


mc586xl2 


116889 


120409 


121130 


143792 


139398 


0.15 


131980 


0.08 


mc587xl2 


140679 


145043 


145804 


174270 


164883 


0.13 


160256 


0.09 
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Abstract. The problem considered is that of determining the number 
of subsequences obtainable by deleting t symbols from a string of length 
n over an alphabet of size s. Recurrences are proven and solved for the 
maximum and average case values, and bounds on these values are ex- 
hibited. 



1 Problem Definition 

We prove bounds on the number of subsequences of a given length that a string 
on a fixed-size alphabet can have. Such bounds have been the basis for an efficient 
algorithm that reconstructs a binary string from knowledge of a sufficient number 
of its subsequences 0 . This research area is linked to applications of Levenshtein 
distance whose usage “plays the central role” in “the study of block codes capable 
of correcting substitution and synchronization errors” |2I • 

An L-string is a string over alphabet L, where \L\ = s; denotes the set 
of all length n L-strings. A series is a maximal run of identical symbols and 
t{X) denotes the number of series in string X . A subsequence Y of string X is 
a string obtained by deleting 0 or more symbols from X, and X is said to be 
a supersequence of Y . Dt{X) denotes the set of subsequences of X that can be 
obtained by deleting exactly t symbols from X. 

Calabi (PJ, as cited in |2|) proved that a particular string form attains the 
maximum value of \Dt{X)\ and found an expression for the generating function 
of that maximum value. We present a direct alternative proof of the upper bound 
and prove a simple underlying recurrence. 

Levenshtein PI proved that, for any binary string A, < \Dt{X)\ < 

(T(jv)+t-i)^ These bounds can be generalized to L-strings jS|. However, while the 
upper bound is tight, the lower bound is not. We prove a tight lower bound. 

Assuming that X is equally likely any string in L„, we derive and solve a 
recurrence on the average value of \Dt{X)\. 

2 Upper Bounds for |£>t(X)| 

We determine an upper bound on the number of subsequences obtainable by 
deleting t symbols from a string of length n over an alphabet of size s. 



M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. 115 hi 22I 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



116 Daniel S. Hirschberg 



Let L = {(Ti, . . . , CTs}, where the {ai} are listed in some order. Let = 
Cl, . . . , c„ be a string in where Ci = (Ti_|_(i_i mod s) • Thus, has the symbols 
of L in circular order, cycling as many times as needed. 

Let ds{t, n) denote the number of subsequences obtainable by delet- 

ing t symbols from C„, where L has cardinality s. 

Calabi (P, as cited in P) proved that, for all X in L„, \Dt{X)\ < ds{t,n), 
and that ds{t,n) is the coefficient of a;" in the generating function: (j){x) = 

Apparently, “the proof is rather involved” and was not 
published. In our Theorem ^ we present a direct alternative proof of the upper 
bound. In our Theorem|^ we prove a simple recurrence on ds{t,n). 

We use to denote the subset of strings in set Q that begin with symbol a. 
For example, D\{X) denotes a set of subsequences of X that start with symbol 
b. li Q and R are sets then we use Q -I- i? to denote Q U i? with the assertion 
that Q and R are disjoint and, thus, \Q + R\ — \Q\ + |i?|. 

Lemma 1. For any L-string X, Dt{X) = 

Proof. The set of strings is partitioned into subsets organized by each string’s 
first symbol. □ 



Lemma 2. For s > 1, ds{t — l,n — 1) < dg{t, n). 

Proof. ds{t — 1, n — 1) counts subsequences of length n — t as does ds{t, n), but 
of a smaller string. □ 

If Q is a set of L-strings and a G L is a symbol then aQ denotes the set of 
L-strings {(rq\q G Q}. 

Lemma 3. For s > 1, ds{t, n) = X)i=i ds{t + 1 — i,n — i). 

Proof. Let C„ = Ci, . . . , c„ be a string in L„, where Ci = mod s)- Thus, 

Cn has the symbols of L in circular order, beginning with aj. Using Lemma D 

we see that A(C„) = = J2i=i ^iDt+i-iiCl^^^P). The statement 

of the lemma follows directly. □ 



Theorem 1. For s > 1 and for any X G Ln, \Dt{X)\ < ds{t,n). 

Proof. By induction on n and n — t. The theorem is trivially true for n < 1 and 
n — t < 1. Let X = Xi . . .Xn. Let fi be the smallest index j such that Xj = cri 
(and /i is n -|- 1 if Ui does not appear in X), where the elements of L, {ai}, are 
ordered by their first appearance in X, thereby ordering fi smallest to largest. 
Consequentially, fi> i. We use X[i : j] to denote the substring Xi . . .Xj of X . 
Using Lemma n we have 

S S 

Dt{X) = J2Df'{X) = ^a,A+i-/.(A[/, + 1 : n]) . 

2=1 2=1 
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Therefore, 



|AW| = ^|A+i-/,(X[/, + l:n])| 
2=1 



S 

< ^ \Dt-^i-f.{Cn-fi)\, using the inductive hypothesis, 
2=1 



S 

< \Dt+i-i{Cn-i)\, because fi> i and by applying LemmaEl 

= ds{t,n), by applying Lemma 0 



□ 



Theorem 2. For 0 <t < n and s >2, ds{t, n) = J2l=o ("i *)ds-i{t — i,t). 
Proof. By induction on n and n — t. 

For the basis when n = 0, t must be zero and it suffices to see that c?s(0, 0) = 
O4-i(0,0) = l. 

For the basis when n — t = 0, since (°) is zero unless z = 0, it suffices to see 
that ds{n,n) = J2i=o -*,«) = ds-i{n,n) = 1. 

For the induction, using the recurrence of Lemma El 

S 

dsft, n) = ds{t, n — 1) + ds{t + 1 — k,n — k) . 

k=2 

Let r = J2k=2 ds{t + I — k,n — k). Then 

s — k 

r = y^ y^ + 1 — k — i,t + I — k), using the inductive hyp., 

k—2 2=0 
s-1 t-j 

-EE (” ■ ^)ds-i{t- j -i,t- j), by letting j = A: - 1, 

j=l i=0 
S— 1 t 

= y^y] {^~l~^)ds-i{t - j as ds-i{t-j = 

j=l i=0 

= E j,t- j) 

i=0 i=l 

t 

— y^ (J^~l~^)ds-i{t — i — l,t), by using LemmaEl 

i=0 

t+1 

= E “ P 0. by letting j = i + I, 
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t 

= - j, t), changing j’s range by noting that, 

j=o 

when j is = 0, and that, when j is t+ 1, ds-i{t — j, t) = 0. 

Therefore, 

ds(t, n) = ds{t, n — 1) + r, using the recurrence of Lemma 0 

t 

= + {^^li^)]ds-i{t — i,t), using the inductive hypothesis, 

i=0 

t 

= — i,t), using the binomial recurrence. 

i=0 



□ 



Corollary 1. For0<t< n,d 2 (t,n) = J2l=o ("i *)/ for 0 < t < n,d 3 {t,n) = 

E t /n—t\ s^t — i /i\ 

i— 0 Vi/ 2^j—0 \j) ■ 

Proof. This follows immediately from Theorem 0 and the fact that di{t,n) = 1 
for 0 < t < n. □ 

Observations. By evaluating ds{t, n) and expressing its difference from (") as a 
power series, one can see that, for t > s,ds{t,n) = (") — s)!+0(n*“®). 

We note that the problem of calculating the number, is{t,n), of superse- 
quences obtainable by inserting t symbols in a length n string X on an alphabet 
of size s is much simpler, and is invariant over X. It is known 0 that, using 
a binary alphabet, i 2 (t,n) = generalized to s > 2 p|. 

It is easy to see that is{t,n) = is{t,n — I) -|- (s — I)is(t — l,n), with bound- 
ary conditions is(0,n) = I and Zs(t, 0) = s‘. (Let X G Ln-i,Y G Ln+t-i, and 
a,b G L. Then bY is a supersequence of aX if and only if either (1) a = 6 and 
y is a supersequence of X, or (2) a 6 and y is a supersequence of aX.) This 
recurrence is solved by is(t,n) = ~ 

As we have just seen, the recurrence for supersequences is very simple and 
intuitive. The recurrence of Theorem 0for subsequences is simple but currently 
lacks an intuitive explanation. 



3 A Lower Bound for |L>^(X)| 

It was stated |fild| that, for any binary string X, \Dt{X)\ > We 

note that this bound is the same as -I- improve and 

generalize this bound. We first need a few lemmas. 

Lemma 4. For any L-strings U,V and any a G L, \ Dt{UV)\ < \Dt{UaV)\. 
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Proof. For all subsequences uoiU and v of V, uv G Dt{UV) uav G Dt{UaV) 
and uv ^ u'v' — > uav ^ u'av' . □ 

Lemma 5. If X is an L-string such that t{X) = n then there exists a string 
Y G Ln, with t(Y) = n, such that \Dt{Y)\ < \Dt{X)\. 

Proof. Let Y be the length n L-string consisting of one symbol from each of the 
series in X . String X can be obtained from by a sequence of symbol insertions. 
The statement of the lemma then follows from repeated applications of Lemma 

El □ 

Lemma 6. If X is a string in Ln such that t{X) = n then \Dt{X)\ > d 2 {t,n). 

Proof. By induction on n and n — t. The lemma is trivially true for the base 
cases, when n<2 or n — t<2. For the induction step, let X = abY, where 
a yf 6 because each series in X has length 1. Then, 

Dt{X) = D^{X) + D\{X) + ^ Df{X) 

<7^a,b 

D D-{X) + D\{X) = aDtibY) + 6A-i(L) . 

Using the inductive hypothesis and Lemma 0 we obtain 

mX)\>\Dt{bY)\ + \Dt.,{Y)\ 

> d 2 {t, n - 1) + d 2 {t - l,n - 2) = d 2 {t, n) . 



□ 

Theorem 3. For any L-string X, |I?t(X)| > X)i=o bound is 

tight. 

Proof. Follows directly from Corollary El and Lemmas 0 and 0 □ 

4 The Average Number of Subsequences 

Under the assumption that all length n L-strings are equiprobable, the average 
number of subsequences obtainable by deleting one symbol has been shown to be 
(n(s — 1) -I- 1) /s 0. We develop and solve a recurrence on the average number 
of subsequences obtainable by deleting t symbols. 

Let Gt{n) = \^t{X)\ be the sum, over all strings in L„, of the 

number of subsequences obtainable by deleting t symbols. Similarly, G^{n) = 
J2xeL„ is the sum when the subsequences are restricted to begin with 

symbol a. 

We see that, for 0 < t < n, 

G“(n)= ^ \D^{X)\=J2 E 

XeL„ beLX^L^ 



( 1 ) 
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If b = a then {X G L“} = {aY\Y G L„_i}. We note that the count of 
subsequences of oY that start with a and have length n — t is the same as 
the count of subsequences of Y that have length n — 1 — t because of a sim- 
ple bijection between those two sets of subsequences. As a result, we see that 

= EygL„-i 

If b ^ a then {X G L^} = {bY\Y G L„_i}. We note that the count of 
subsequences of bY that start with a and have length n — t is the same as the 
count of subsequences of Y that start with a and have length n — t because the 
leading b of bY can just be discarded. As a result, we see that \^ti^) \ = 

SygL„_i 

Therefore, 

G“(n)= ^ |i^,(X)| + (s-l) ^ lA-iWI- (2) 

We then see that 

G“(n) = Gt(n-l) + (s-l)G“_i(n-l) (3) 

follows immediately from (D, 0 and the definitions. 

From the fact that Gt{n) = using Q s times, once for each 

symbol in L, we obtain 

Gt{n) = sGt{n - 1) + (s - l)Gt_i(n - 1) . (4) 

Boundary conditions, Go(n) = G„(n) = s", hold because there is only one 
string obtainable by deleting none or all of the symbols in each of the s” strings 
in 

Let Et{n) be the average (or expected) value of |Z?t(A)|, where X can equally 
likely be any string in L„, and let A = 1 — 1/s. 

Theorem 4. For 0 < t < n, Et{n) = Et{n — 1) -|- XEt-i{n — 1), and Eo{n) = 
Et{t) = 1. 

Proof. This follows from the recurrence m and boundary conditions on G and 
the fact that Et{n) = Gt{n)/s'^. □ 

Theorem 5. Et{n) = 

Proof. We use a generating function (see, for example, Liu jjj). 

Let En{x) = J2^o^t{n)x*. We note that Et{n) = 0 if t > n, and that 
Et{n) = 1 if t = n or t = 0. Then, 

n— 1 

Fn{x) - 1 - ^ Et{n)x^ 

n— 1 n— 1 

= ^ Et{n — l)x^ + A ^ Et-i{n — l)x* 
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n— 1 n— 2 

= ^ Et{n — l)x* + Xx ^ Et{n — l)a;* 

t=i t=o 



= [Fn-i{x) - Eo{n - 1)] + X[xFn-i{x) - x^En-i{n 



!)]■ 



Therefore, 



Fn{x) = (1 + Xx)Fn-l{x) + X^{1 - A) . 



By iterated expansion of 0 , we obtain 



( 5 ) 



n—1 

Fn{x) = (1 + Ax)” + ^(1 + Ax)*(l - A)x"-* 

= E(”)AV + £'^Q)A^(1-A)x”+^-“ . (6) 

z— 0 z— 0 j—0 

In the expression 0 of Fn{x), the coefficient of x‘ is Et{n) = (”)A‘ + 
~ where j = i — {n — t) > 0. Therefore we can restrict the 
summation to i > n — t and, letting k = n — i, we get 

£.(") = K)*‘ + E f;:hA‘-‘(i - A) . 



Noting that (,_*} - (”_bi) = C‘t-k finally obtain 



E,(n) = {;)A- + E {yE‘)A'-‘ - (-;)a- + (--■) 

k^l 

= (”7‘)A‘ + Er'-‘«)A‘ + n 

z^l 

□ 

Observation. For t < nX/2, Et{n) > X)j=o — ■93-Et(n). 
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Abstract. The study of approximately periodic strings is relevant to 
diverse applications such as molecular biology, data compression, and 
computer-assisted music analysis. Here we study different forms of ap- 
proximate periodicity under a variety of distance rules. We consider three 
related problems, for two of which we derive polynomial-time algorithms; 
we then show that the third problem is NP-complete. 



1 Introduction 

Repetitive or periodic strings have been studied in such diverse fields as molecu- 
lar biology, data compression, and computer-assisted music analysis. In response 
to requirements arising out of a variety of applications, interest has arisen in algo- 
rithms for finding regularities in strings; that is, periodicities of an approximate 
nature. Some important regularities that have been studied in the literature are 
the following: 

— Periods: A string p is called a period of a string x if a; can be written as 
X = p^p' where k > 1 and p' is a prefix of p. The shortest period of x is 
called the period of x. For example, if x = abcabcab, then abc, abcabc, and x 
are periods of x, while abc is the period of x. If x has a period p such that 
IpI ^ |2;|/2, then x is said to be periodic. Further, if setting x = p^ implies 
fc = 1, X is said to be primitive] if fc > 2, is called a repetition. 

— Covers: A string w is called a cover of x if x can be constructed by concate- 
nations and superpositions of w. For example, if x = ababaaba, then aba and 
X are the covers of x. If x has a cover w ^ x, x is said to be quasiperiodic; 
otherwise, x is superprimitive. 

— Seeds: A substring w of x is called a seed of x if it is a cover of any superstring 
of X. For example, aba and ababa are some seeds of x = ababaab. 
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— Repetitions: A substring w of x that is a repetition is called a repetition 
or tandem repeat in x. For example, if a: = aababab, then aa and ababab are 
repetitions in x; in particular, = aa is called a square and (a6)^ = ababab 
is called a cube. 

The notions cover and seed are generalizations of periods in the sense that 
superpositions as well as concatenations are used to define them. A significant 
amount of research has been done on each of these four notions: 

— Periods: The preprocessing of the Knuth-Morris-Pratt algorithm m finds 
all periods of x in linear time — in fact, all periods of every prefix of x. In 
parallel computation, Apostolico, Breslauer and Galil |2j gave an optimal 
O (log log n) time algorithm for finding all periods, where n is the length of 

X. 

— Covers: Apostolico, Farach and Iliopoulos 0 introduced the notion of cov- 
ers and described a linear-time algorithm to test whether x is superprim- 
itive or not Isee also |YIXH Y]b Moore a.nd Smvth m and recently Li and 
Smyth m gave linear-time algorithms for finding all covers of a;. In paral- 
lel computation, Iliopoulos and Park m obtained an optimal O(loglogn) 
time algorithm for finding all covers of x. Apostolico and Ehrenfeucht 0 
and Iliopoulos and Mouchard HE] considered the problem of finding maxi- 
mal quasiperiodic substrings of a;. A two-dimensional variant of the covering 
problem was studied in j1 1 1I Sj , and minimum covering by substrings of given 
length in m- 

— Seeds: Iliopoulos, Moore and Park M introduced the notion of seeds and 
gave an 0{n log n) time algorithm for computing all seeds of x. For the same 
problem Berkman, Iliopoulos and Park |0j presented a parallel algorithm 
that requires O(logn) time and 0(n log n) work. 

— Repetitions: There are several 0(n log n) time algorithms for finding all 
the repetitions in a string mm- In parallel computation, Apostolico and 
Breslauer |P gave an optimal O(loglogn) time algorithm (i.e., total work is 
0(n log n)) for finding all the repetitions. 

A natural extension of the repetition problems is to allow errors. Approx- 
imate repetitions are common in applications such as molecular biology and 
computer-assisted music analysis m- Among the four notions above, only ap- 
proximate repetitions have been studied. If a; = uww'v where w and w' are 
similar, ww' is called an approximate square or approximate tandem repeat. 
When there is a nonempty string y between w and w', we say that w and 
w' are an approximate nontandem repeat. In Landau and Schmidt gave an 
0{kn log k log n) time algorithm for finding repeated patterns whose edit distance 
is at most A: in a text of length n. Schmidt also gave an 0{n^ log n) algorithm 
for finding approximate tandem or nontandem repeats in 123 which uses an 
arbitrary score for similarity of repeated strings. 

In this paper, we introduce the notion of approximate periods which can 
be considered as an approximate version of three notions periods^ covers^ and 
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seeds. Here we study different forms of approximate periodicity under a vari- 
ety of distance rules. We consider three related problems, for two of which we 
derive polynomial-time algorithms; we then show that the third problem is NP- 
complete. 

2 Preliminaries 

A string is a sequence of zero or more characters from an alphabet S. The set 
of all strings over the alphabet E is denoted by S* . The empty string is denoted 
by e. The zth character of a string x is denoted by x[i\. A substring of x that 
starts at position i and ends at position j is denoted by x[i..j]. 

A string w is a prefix of x if a; = wu for u G E*. Similarly, w is a suffix of x 
if X = uw for u G E*. A string w is a subsequence of x (or x is a supersequence of 
w) if w is obtained by deleting zero or more characters (at any positions) from 
X. For example, ace is a subsequence of aabcdef. 

2.1 Measures 

Absolute measures. To measure the similarity (or distance) between two 
strings, the Hamming distance and the edit distance are widely used. The 
Hamming distance between two strings x and y is defined to be the smallest 
number of change operations to convert x to y. The edit distance is defined to 
be the smallest number of change, insert, and delete operations to convert x 
to y. In more general cases, especially in molecular biology, a penalty matrix 
is used. A penalty matrix specifies the substitution cost for each pair of 
characters and the insertion/deletion cost for each character. An arbitrary 
penalty matrix can also be used as a relative measure because it can contain 
both positive and negative costs m- It is common to assume that a penalty 
matrix satisfies the triangle inequality m- 
Relative measures. When we want to compare the similarity between x and 
y and the similarity between x' and y' , we need relative measures (rather 
than absolute measures) because the lengths of the strings x, y, x' , y' may be 
different. There are two ways to define relative measures between x and y: 
~ First, we can fix one of the two strings and define a relative measure with 
respect to the fixed string. The error ratio with respect to x is defined to 
be t/\x\, where t is an absolute measure between x and y. 

— Second, we can define a relative measure symmetrically. The symmetric 
error ratio is defined to be t/l, where t is an absolute measure between 
X and y, and I = (|x| -I- |?/|)/2 Note that we may take Z = |x| -I- |?/| 
(then everything is the same except that the ratio is multiplied by 2). 

3 Problem Definitions 

Given two strings x and p, we define approximate periods as follows. If there 
exists a partition of x into disjoint blocks of substrings, i.e., x = piP 2 '''Pr 
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{Pi ^ e) such that the distance between p and pi for every 1 < z < r is less 
than or equal to t, we say that p is a t- approximate period of x (or p is an 
approximate period of x with distance t). Each Pi, I < i < r, will be called 
a partition block of x. Note that there can be several versions of approximate 
periods according to the definition of distance. This definition of approximate 
periods can be considered as an approximate version of the three notions periods, 
covers, and seeds discussed above, because 

(i) superpositions in defining covers and seeds and 

(ii) extra characters at the ends of a given string in defining periods and seeds 

can be accounted for in some degree when we use edit distances for the measure. 
Of course, if we allow overlaps between pi’s, then we could extend the definition of 
an approximate period. But this will merely increase the complexity of problems 
of finding approximate periods. 

We consider the following problems related to approximate periods. 

Problem 1. Given x and p, find the minimum t such that p is a t-approximate 
period of x. 

Since p is fixed in this case, it makes no difference whether we use the 
absolute Hamming (or edit) distance or the error ratio with respect to p. We can 
also use a penalty matrix for the measure. If a threshold k on the edit distance 
is given as input in Problem 1, the problem asks whether p is a /c-approximate 
period of x or not. 

Problem 2. Given a string x, find a substring p of x that is an approximate 
period of x with the minimum distance. 

Since the length of p is not (a priori) fixed in this problem, we need to use 
relative measures (i.e., error ratios or penalty matrices) rather than absolute 
measures. 

Problem 3. Given a string x, find a string p that is an approximate period of x 
with the minimum distance. 

This problem is harder than Problem 2 because p can be any string, not 
necessarily a substring of x. 



4 Algorithms and NP-Completeness 



Basically we will use arbitrary penalty matrices for the measure of similarity in 
each problem. Recall that a penalty matrix defines the substitution cost for each 
pair of characters and the insertion or deletion cost for each character. 
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4.1 Problem 1 

Our algorithm for Problem 1 consists of two steps. Let n = \x\ and m = \p\. 

1 . Compute the distance between p and every substring of x. 

2. Compute the minimum t such that p is a t-approximate period of x. We use 
dynamic programming to compute t. Let wij be the distance between p and 
x[i..j]. These values of Wij are obtained from the first step. Let C be the 
minimum value such that p is a ti-approximate period of a;[l..i]. Let to = 0- 
For t = 1 to n, we compute ti by the following formula: 

ti = min (max(t/i, 

0<h<i 

The value is the minimum t such that p is a t-approximate period of x. 

To compute the distances in step 1, we use the dynamic programming table 
called the D table. To compute the distance between two strings x and y, a, D 
table of size (|a;| + l) x (|y| + l) is used. Each entry D[i,j] (0 < t < |a:|, 0 < j < |y|) 
stores the minimum cost of transforming a:[l..t] to y[l..j]. Initially, H[0,0] = 0, 
D[i, 0] = D[i — 1,0] + S{x[i],A), and D[0,j] = D[0,j — 1] + S{A, y[j]). Then we 
can compute all the entries of the D table in 0(|a;||p|) time by the following 
recurrence: 

(D[i-l,j] +5{x[i],A) 

D[i,j\ = min I D[i,j - 1] + S{A,y[j]) 

[ D[i- l,j - 1] + 5{x[i\,y[j]) 

where S(a, b) is the cost of transforming the character a to b. {A is a space, so 
6{a, A) means the deletion cost of a and 5{A, a) means the insertion cost of a.) 

Theorem 1. Problem 1 can be solved in 0{mn^) time when an arbitrary penalty 
matrix is used for the measure of similarity. If the edit distance (resp. the Ham- 
ming distance) is used for the measure, it can be solved in 0{mn) time (resp. in 
0{n) time). 

Proof. For an arbitrary penalty matrix, step 1 takes 0{mn^) time since we make 
a D table of size m x (n — z+ 1) for each position i of x. In step 2, we can compute 
the minimum t in 0(nf) time since we compare 0{n) values at each position of 
X. Thus, the total time complexity is 0{mn^). 

When the edit distance is used for the measure of similarity, this algorithm 
for Problem 1 can be improved. In this case, 5{a, b) is always 1 if a yf 5; S{a, b) = 
0, otherwise. Now it is not necessary to compute the edit distances between p 
and the substrings of x whose lengths are larger than 2m because their edit 
distances with p will exceed m. (It is trivially true that p is an m-approximate 
period of x.) Step 1 now takes 0{mfn) time since we make a D table of size 
m X 2m for each position of x. Also, step 2 can be done in 0{mn) time since we 
compare 0{m) values at each position of x. Thus the time complexity is reduced 
to 0{mfn). 
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However, we can do better. Step 1 can be solved in 0{mn) time by the 
algorithm due to Landau, Myers, and Schmidt m- Given two strings x and y 
and a forward (resp. backward) solution for the comparison between x and y, 
the algorithm in m incrementally computes a solution for x and by (resp. yb) 
in 0{k) time, where b is an additional character and A: is a threshold on the edit 
distance. This can be done due to the relationship between the solution for x 
and y and the solution for x and by. When k = m (i.e., the threshold is not 
given), we can compute all the edit distances between p and every substring of x 
whose length is at most 2m in 0{mn) time using this algorithm. Therefore, we 
can solve Problem 1 in 0{mn) time if the edit distance is used for the measure 
of similarity. 

If we use the Hamming distance for the measure, it takes trivially 0{n) time 
since x must be partitioned into blocks of size m. □ 

When the threshold k on the edit distance is given as input for Problem 1, 
it can be solved in 0{kn) time because each step of the above algorithm takes 
0{kn) time. 



4.2 Problem 2 

Let p be a candidate string for the approximate period of x. If the Hamming (or 
edit) distance is used for Problem 2, we need to use relative measures because 
the length of p varies. (If the absolute Hamming or edit distance is used, every 
substring of x of length 1 is a 1-approximate period of x.) We can use the error 
ratio t/l for the measure of similarity, where t is the Hamming (or edit) distance 
between the two strings and I is either the average length of the two strings 
(symmetric error ratio) or the length of p (error ratio with respect to p) . 

When the relative edit distance is used for the measure of similarity. Problem 
2 can be solved in 0{n^) time by our algorithm for Problem 1. If we take each 
substring of a; as p and apply the 0{mn) algorithm for Problem 1 (that uses 
the algorithm in |2Dj), it takes 0(|p|n) time for each p. Since there are O(n^) 
substrings of a;, the overall time is O(n^). 

Without using the somewhat complicated algorithm in mi, however, we can 
solve Problem 2 in 0{ri^) time by the following simple algorithm for arbitrary 
penalty matrices. 

Let R be the minimum distance so far. Initially, R = oo. For i = 1 to n, 
we do the following. For each i, we process the n — i + 1 substrings that start at 
position i. Let m be the length of a chosen substring of x as p. Let to = 1. 

1. Take x[i..i + m — 1] as p and compute the distance between p and every 
substring of x. This can be done by making n D tables with p and each of 
n suffixes of x. By adding just one row to each of previous D tables (i.e., n 
D tables when p = x[i..i + m — 2]), we can compute these new D tables in 
O(n^) time. (Note that when to = 1, we create new D tables.) 

2. Compute the minimum distance t such that p is a t-approximate period of 
X. This step is similar to the second step of the algorithm for Problem 1. Let 
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Whj be the distance between p and x[h..j] which is obtained from step 1. Let 
tj be the minimum value such that p is a t^-approximate period of 
and let to = 0. For j = 1 to n, we compute tj by the following formula: 

tj = min (max(t?i,w/i+ij)). 

0<h<j 

The value is the minimum t such that p is a t-approximate period of a;. If 
t is smaller than R, we update R with t. If m < n — i + 1, increase m by 1 
and go to step 1. 

When all the steps are completed, the final value of R is the minimum distance 
and a substring that is an i?-approximate period of x is an answer to Problem 

2 . 

Theorem 2. Problem 2 can he solved in 0{n‘^) time when an arbitrary penalty 
matrix is used for the measure of similarity. If the Hamming distance is used for 
the measure, it can be solved in 0{n^) time. 

Proof. For an arbitrary penalty matrix, we make n D tables in 0{n^) time in 
step 1 and compute the minimum distance in 0{n^) time in step 2. For m = 1 
to n — i + 1, we repeat the two steps. Therefore, it takes 0{n^) time for each 
i and the total time complexity of this algorithm is O(n^). If the relative edit 
distance is used, this algorithm can be slightly simplified as in Problem 1, but 
it still takes time O(n^). 

If the relative Hamming distance is used for the measure. Problem 2 can be 
solved in 0{n^) time because there are O(n^) candidates for p and 0{n) time is 
required for each candidate. □ 

4.3 Problem 3 

Given a set of strings, the shortest common supersequence (SCS) problem is 
to find a shortest common supersequence of all strings in the set. The SCS 
problem is NP-complete ESEH- We will show that Problem 3 is NP-complete 
by a reduction from the SCS problem. In this section we will call Problem 3 the 
AP problem (abbreviation of the approximate period problem). 

The decision versions of the SCS and AP problems are as follows: 

Definition 1. Given a positive integer m and a finite set S of strings from S* 
where E is a finite alphabet, the SCS problem is to decide if there exists a string 
w with |w| < m such that w is a supersequence of each string in S . 

Definition 2. Given a number t, a string x from (S')* where S' is a finite 
alphabet, and a penalty matrix, the AP problem is to decide if there exists a 
string u such that u is a t-approximate period of x. 



Now we transform an instance of the SCS problem to an instance of the AP prob- 
lem. We can assume that S = {0, 1} since the SCS problem is NP-complete even 
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Fig. 1. The penalty matrix M 



if r = {0,1} |ZSE2|. First, we set S' = S [J (a, b, #, $, *i, *2, Assume that 
there are n strings si,...,s„ in S. Let x = #si$#S2$ • ■ • 

Then, set t = m and define the penalty matrix as in Figure 1, where a shaded 
entry can be any value greater than m. It is easy to see that this transformation 
can be done in polynomial time. Note that the penalty matrix M is a metric. 

Lemma 1. Assume that x is constructed as above. If u is an m- approximate 
period of x, then u is of the form where a G (a, 6}™. 

Proof. We first show that u must have one ff and one $. 

1. Suppose that u has no # (resp. $). Clearly, there exists a partition block of 
X which has at least one (resp. $), and the distance between u and the 
block is greater than m. Therefore, u must have at least one and at least 
one $. 

2. Suppose that u has more than one (or $). Assume that u has two ff’s. 

(The other cases are similar.) Then u must also have two $’s because unless 
the number of ff’s equals that of $’s in u, at least one partition block of x 
cannot have the same numbers of #’s and $’s to those of u. Consider the last 
partition block of x. Since the last block must have two #’s and two $’s as 
u, it contains For the distance between u and the last block 

of x to be at most to, u must have at least to characters from {*i,*2|- In 
such cases, however, the distance between u and any other partition block 
of X will exceed to. 

It remains to show that u = ffa% where a G (a, 6}"*. Since u has one and 
one $, X must be partitioned just after every occurrence of $. Let u be of the 
form /3#Q!$7, where /3, a, 7 G (0, 1, a, b, *1, *2, A}*. Consider the last two blocks 
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#*i™$ and #*2"*$ of X. If a contains i *i’s for i > 1, a must also have i *2’s and 
the remaining m — 2i characters in a must be from {a, b} so that the distances 
between u and the last two blocks of x do not exceed m. However, this makes 
the distance between u and any other partition block of x exceed m due to *i’s 
and *2’s in a. Hence a cannot have *1 or *2- Also, a cannot have any character 
from {0, 1, A} since 0, 1 and A have cost 2 with *1 and *2 in the last two blocks 
of x. For the distances between u and the last two blocks of x to be at most m, 
(3 and 7 must be empty and a must be of the form {a, 6}™. □ 



Theorem 3. The AP problem is NP-complete. 

Proof. It is easy to see that the AP problem is in NP. To show that the AP 
problem is NP-complete, we need to show that S has a common supersequence 
w such that |w| < to if and only if there exists a string u such that u is an 
TO-approximate period of x. 

(if) By Lemma 1, u = where a € {a, 6}™. Since u is an TO-approximate 
period of x, the distance between u and each partition block ffsS is at most to. 
(The distances between u and the last two blocks and #*2"*$ are always 

TO.) Since |a| = to and the distance between a and Si is at most to, each 0 (resp. 
1) in Si must be aligned with a (resp. b) in a. That is, each a (resp. b) in a must 
be aligned with 0 (resp. 1) or A in Si. If we substitute 0 for a and 1 for b in a, 
we obtain a common supersequence w of si,...,s„ such that |w| = to. (Note 
that if a or 6 in a is aligned with A for all Si, we can delete the character in a 
and we can obtain a common supersequence which is shorter than to.) A similar 
alignment was used by Wang and Jiang j3()| . 

(only if) Let s be a common supersequence of S such that |s| < to. Let a be 
the string constructed by substituting a for 0 and b for 1 in s. Partition x just 
after every occurrence of $. The distance between each partition block of x and 
#a$ is at most to since each a (resp. b) in a can be aligned with 0 (resp. 1), A, 
*1) or *2 in each partition block. Therefore, is an TO-approximate period of 

X. □ 
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Abstract. A pair in a string is the occurrence of the same substring 
twice. A pair is maximal if the two occurrences of the substring cannot 
be extended to the left and right without making them different. The gap 
of a pair is the number of characters between the two occurrences of the 
substring. In this paper we present methods for finding all maximal pairs 
under various constraints on the gap. In a string of length n we can find 
all maximal pairs with gap in an upper and lower bounded interval in 
time 0(n log n + z) where a is the number of reported pairs. If the upper 
bound is removed the time reduces to 0{n + z). Since a tandem repeat is 
a pair where the gap is zero, our methods can be seen as a generalization 
of finding tandem repeats. The running time of our methods equals the 
running time of well known methods for finding tandem repeats. 



1 Introduction 

A pair in a string is the occurrence of the same substring twice. A pair is left- 
maximal (right-maximal) if the characters to the immediate left (right) of the 
two occurrences of the substring are different. A pair is maximal if it is both 
left- and right-maximal. The gap of a pair is the number of characters between 
the two occurrences of the substring. For example, the two occurrences of the 
substring ma in the string maximal form a maximal pair of ma with gap two. 

Gusfield Pl Sect. 7.12.3] describes how to report all maximal pairs in a string 
using the suffix tree of the string in time 0(n + z) and space 0(n), where n 
is the length of the string and z is the number of reported pairs. Since there 
is no restriction on the gap of the maximal pairs reported by this algorithm, 
many of them probably describe occurrences of substrings that are overlapping 
or far apart in the string. In many applications in computational biology this 
is unfortunate, so several papers address the problem of finding occurrences of 
similar substrings not too far apart umm- 
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number 20244 (ALCOM-IT). 

M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. 134 h14VjI 1999. 
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In this paper we will describe how to find all maximal pairs in a string with 
gap in an upper and lower bounded interval in time 0(n log n+z) and space 0{n) . 
The interval of allowed gaps can be chosen such that we report a maximal pair 
only if the gap is between constants ci and C 2 , but more generally, it can be 
chosen such that we report a maximal pair of a only if the gap is between gi(|a|) 
and g 2 i\oi\), where g\ and 52 are functions that can be computed in constant time. 
This, for example, makes it possible to find all maximal pairs with gap between 
zero and some fraction of the length of the repeated substring. If we remove the 
upper bound on allowed gaps, and only require the gap of a reported pair of a 
to be at least gi{\a\), then the running time reduces to 0{n + z). The methods 
we present all use the suffix tree as the fundamental data structure combined 
with efficient methods for merging search trees and heap-ordered trees. 



The problem of finding occurrences of repeated substrings in a string is well 
studied. Most of the work has been concerned with efficient methods for finding 
occurrences of contiguously repeated substrings. An occurrence of a substring 
of the form aa is called an occurrence of a square or a tandem repeat. Most 
well-known methods for finding the occurrences of all tandem repeats in a string 
require time 0(n log n + z), where n is the length of the string and z is the num- 
ber of reported occurrences of tandem repeats [51211 81 15I24| . Work has also been 
done on just detecting whether or not a string contains a tandem repeat 1 ™ . 
Recently, extending on the idea presented in |^, two methods have been pre- 
sented that find a compact representation of all tandem repeats in a string in 
time 0{n) HHDI. Other papers consider the problem of finding occurrences of 
contiguous repeats of substrings that are within some Hamming- or edit-distance 
of each other m- 



In biological sequence analysis searching for tandem repeats is used to re- 
veal structural and functional information ^ pp. 139-142], but searching for 
exact tandem repeats can be too restrictive because of sequencing and other 
experimental errors. By searching for maximal pairs with small gaps (maybe 
depending on the length of the substring) it could be possible to compensate 
for these errors. On the other hand, finding maximal pairs with a gap within 
an interval can be seen as a generalization of finding occurrences of tandem re- 
peats. Stoye and Gusfield m say that an occurrence of the tandem repeat aa 
is a branching occurrence of the tandem repeat aa if and only if the characters 
to the immediate right of the two occurrences of a are different, and they ex- 
plain how to deduce the occurrence of all tandem repeats in a string from the 
occurrences of branching tandem repeats in time proportional to the number 
of tandem repeats. Since a branching occurrence of a tandem repeat is just a 
right-maximal pair with gap zero, the methods presented in this paper can be 
used to find all tandem repeats in time 0{nlogn + z). This matches the time 
bounds of previous published methods for this problem 






The rest of this paper is organized as follows. In Sect. Owe define pairs and 
suffix trees and describe how in general to find pairs using the suffix tree. In 
Sect. Owe present facts about efficient merging of search trees, and use them to 
formulate methods for finding all maximal pairs in a string with gap in an upper 
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and lower bounded interval. In Sect. 0 we briefly discuss how to And all maximal 
pairs in a string with gap in a lower bounded interval. Finally, in Sect. 0 we 
summarize our work and discuss open problems. 

2 Preliminaries 

Throughout this paper S will denote a string of length n over a finite alphabet E. 
We will use for i = 1, 2, . . . , n, to denote the ith character of S', and use 
S[i ..j] as notation for the substring S[z]S[i + 1] • • • S[j] of S. To be able to refer 
to the characters to the left and right of every character in S without worrying 
about the first and last character, we define S[0] and S[n+ 1] to be two distinct 
characters not appearing anywhere else in S. 

In order to formulate methods for finding repetitive structures in S, we need 
a proper definition of such structures. An obvious definition is to And all pairs of 
identical substrings in S. This, however, leads to a lot of redundant output, e.g. 
in the string that consists of n identical characters there are 0{n^) such pairs. To 
limit the redundancy without sacrificing any meaningful structures Gusfleld j0| 
defines maximal pairs. 

Definition 1 (Pair). We say that (i,j, |a|) is a pair of a in S formed by i and j 
if and only if 1 < i < j < n — \a\ + l and a = ..i + |a| — 1] = S[j ..j + \a\ — 1]. 

The pair is left-maximal (right-maximal) if the characters to the immediate left 
(right) of two occurrences of a are different, i.e. left-maximal if S[i — 1] yf — 1] 
and right-maximal i/S'[i+|a|] yf ^[j + lal]. The pair is maximal if it is right- and 
left-maximal. The gap of a pair {i,j, |a|) is the number of characters j — i — |o!| 
between the two occurrences of a in S. 

It follows from the definition that a string of length n in the worst case con- 
tains 0(nf) right-maximal pairs. The string a” contains the worst case number 
of right-maximal pairs but only 0{n) maximal pairs. The string how- 

ever contains 0(n^) maximal pairs. This shows that the worst case number of 
maximal pairs and right-maximal pairs in a string are asymptotically equal. 

Figure 0 illustrates the occurrence of a pair. In some applications it might 
be interesting only to find pairs that obey certain restrictions on the gap, e.g. to 
filter out pairs of substrings that are overlapping or far apart and thus to reduce 
the number of pairs to report. Using the “smaller-half trick”, see Sect. 13.1L and 
Lemma 0 it is easy to prove that a string of length n in the worst case contains 
0(nlogn) right-maximal pairs with gap in an interval of constant size. 

In this paper we present methods for finding all right-maximal and maximal 
pairs {i,j, |a|) in S with gap in a bounded interval. These methods all use the 
suffix tree of S as the fundamental data structure. We briefly review the suffix 
tree and refer to jHI for a more comprehensive treatment. 

Definition 2 (Suffix tree). The suffix tree T{S) of the string S is the com- 
pressed trie of all suffixes of S. Each leafinT(S) represents a suffix S[i ..n] of S 
and is annotated with the index i. We refer to the set of indices stored at the 
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Fig. 1. An occurrence of a pair {i,j, |a|) with gap j — i — \a\. 



leaves in the subtree rooted at node v as the leaf-list of v and denote it LL(v). 
Eaeh edge in T(S) is labelled with a nonempty substring of S sueh that the path 
from the root to the leaf annotated with index i spells the suffix S'[i .. n]. We refer 
to the substring of S spelled by the path from the root to node v as the path-label 
of V and denote it L{v). 

The suffix tree T{S) can be constructed in time 0{n) 1 2 H2 012517] . It follows 
from the definition that all internal nodes in T{S) have out-degree between two 
and |A|. We can turn the suffix tree T{S) into the binary suffix tree Tb{S) by 
replacing every node v in T(S) with out-degree d > 2 by a binary tree with d—1 
internal nodes and d — 2 internal edges in which the d leaves are the d children 
of node v. We label each new internal edge with the empty string such that 
the d—1 nodes replacing node v all have the same path-label as node v has 
in T{S). Since T{S) has n leaves, constructing the binary suffix tree Tb{S) 
requires adding at most n — 2 new nodes. Since each new node can be added in 
constant time, the binary suffix tree Tb{S) can be constructed in time 0{n). 

The binary suffix tree is an essential component of our methods. Definition 0 
implies that there is a node v in T{S) with path-label a if and only if a is the 
longest common prefix of S'[i .. n] and .. n] for some 1 < i < j < n. In other 
words, there is a node v with path-label a if and only if {i,j, |a|) is a right- 
maximal pair in S. Since S[i -\- |a|] yf S[j -\- |a|] the indices i and j cannot be 
elements in the leaf-list of the same child of v. Using the binary suffix tree Tb{S) 
we can thus formulate the following lemma. 

Lemma 3. There is a right-maximal pair (i,j, |a|) in S if and only if there is a 
node V in the binary suffix tree Tb{S) with path-label a and distinet ehildren wi 
and W 2 where i S LL{w\) and j G LL{w 2 )- 

Lemma 0 gives an approach to find all right-maximal pairs in S'; for every 
internal node v in the binary suffix tree Tb{S) consider the leaf- lists at its two 
children wi and W 2 , and for every element {i,j) in LL(wi) x LL(w 2 ) report a 
right-maximal pair {i,j, |a|) ii i < j and {j,i, |o|) if j < i. To find all maximal 
pairs in S the problem remains to filter out all right-maximal pairs that are not 
left-maximal. 

3 Pairs with Upper and Lower Bounded Gap 

We want to find all maximal pairs (f,j, |a|) in S with gap between gi(|a|) 
and g 2 {\a\), i.e. (;i(|q;|) < j — i — \a\ < g 2 {\a\), where gi and g 2 are functions 



138 Gerth St0lting Brodal et al. 



|a| +9i(l«l) — Od— |a| +9i(l«l) 



d l«l +92(\a\) |a| + 92{\a\) 1> 







a 




'mmmm 



L{p,\a\) p R{p,\<^\) 



Fig. 2. If {p,q, |a|) (respectively {q,p, |a|)) is a pair with gap between (7i(|a|) 
and g2{\a\), then one occurrence of a is at position p and the other occurrence 
is at a position q in the interval R(j>, |a|) (respectively L{p, |a|)) of positions. 



that can be computed in constant time. An obvious approach is to generate all 
maximal pairs in S and only report those with gap between (;i(|q;|) and g2{\a\), 
but as shown above there might be asymptotically fewer maximal pairs in S 
with gap between gi(|a|) and g2{\a\) than maximal pairs in S in total. We 
therefore want to find all maximal pairs (i,j, |a|) in S with gap between gi(|a|) 
and (72(|Q!|) without generating and considering all maximal pairs in S. A step 
towards finding all maximal pairs with gap between gii(|Q;|) and (72(|Q^|) is to find 
all right-maximal pairs with gap between gi(|a|) and 52(|<a|). 

Figure Elshows that if one occurrence of a in a pair with gap between gi(|a|) 
and g2(|<a|) is at position p, then the other occurrence of a must be at a position q 
in one of the two intervals L{p, |o!|) = [p — |a| — g2(|Q;|) ..p — |a| — gi(|a|)] or 
R{p, |q;|) = [p-l- |a| -I- gi(|a|) ..p + |a| -I- g2(|Q;|)]. Together with Lemma El this 
gives an approach to find all right-maximal pairs in S with gap between gi(|a|) 
and g2(|cr|); from every internal node v in the binary suffix tree Tb{S) with 
path-label a and children w\ and W2, we report for every p in LL(w\) the pairs 
{p,q,\a\) for all q in LL{w2) H i?(p, |a|) and the pairs (g,p, |a|) for all q in 
Ll[w 2 ) n L{p, |a|). 

To report right-maximal pairs efficiently using this procedure, we must be 
able to find for every p in LL{wi), without looking at all the elements in LL(w2), 
the proper elements q in LL{w2) to report it against. It turns out that search 
trees make this possible. In this paper we use AVL trees, but other types of 
search trees, e.g. (a, 6 )-trees EH or red-black trees jS|, can also be used as long 
as they obey Lemmas 0 andlHlstated below. Before we can formulate algorithms 
we review some useful facts about AVL trees. 



3.1 Data Structures 

An AVL tree T is a balanced search tree that stores an ordered set of elements. 
AVL trees were introduced in PJ , but are explained in almost every textbook on 
data structures. We say that an element e is in T, or e € T, if it is stored at a 
node in T. For short notation we use e to denote both the element and the node 
at which it is stored in T. We can keep links between the nodes in T in such a 



Finding Maximal Pairs with Bounded Gap 



139 



way that we in constant time from the node e can find the nodes next{e) and 
prev(e) storing the next and previous element in increasing order. We use |T| to 
denote the size of T, i.e. the number of elements stored in T. 

Efficient merging of two AVL trees is essential to our methods. Hwang and 
Lin H2] show how to merge two sorted lists using the optimal number of com- 
parisons. Brown and Tarjan ^ show how to implement merging of two height- 
balanced search trees, e.g. AVL trees, in time proportional to the optimal num- 
ber of comparisons. Their result is summarized in Lemma ^ which immediately 
implies Lemma 0 

Lemma 4. Two AVL trees of size at most n and m can be merged in time 

o(iogrr))- 



Lemma 5. Given a sorted list of elements ei,e 2 ,... ,e„ and an AVL tree T 
of size at most m, m > n, we can find qi = minja; S T | a; > e^} for all 
i = 1 , 2 , . . . , n in time 0 (log 

Proof. Construct the AVL tree of the elements ei, 62 , . . . , e„ in time 0(n). Merge 
this AVL tree with T according to Lemma 0 except that whenever the merge- 
algorithm would insert one of the elements ei, 62 , . . . , e„ into T, we change the 
merge-algorithm to report the neighbor of the element in T instead. This modi- 
fication does not increase the running time. □ 

The “smaller-half trick” is essential to several methods for finding tandem 
repeats . It says that the sum over all nodes v in an arbitrary binary tree 

of size n of terms that are 0 (ni), where n\ < ri 2 are the numbers of leaves in 
the subtrees rooted at the two children of v, is O(nlogn). Our methods rely on 
a stronger version of the “smaller-half trick” hinted at in Ex. 35] and used 
in El Chap. 5, p. 84]; we summarize it in the following lemma. 

Lemma 6. Let T be an arbitrary binary tree with n leaves. The sum over all 
internal nodes v in T of terms that are 0(log where n\ and are the 

numbers of leaves in the subtrees rooted at the two children ofv, is 0 (n log n). 

Proof. As the terms are 0(log we can find constants, a and b, such that 

the terms are upper bounded by a -I- 6 log We will by induction in the 

number of leaves of the binary tree prove that the sum is upper bounded by 
(2n — l)a -I- felogn!. As logn! = O(nlogn) the lemma follows. 

If r is a leaf then the upper bound holds vacuously. Now assume inductively 
that the upper bound holds for all trees with at most n — 1 leaves. Let T be 
a tree with n leaves where the number of leaves in the subtrees rooted at the 
two children of the root are n\ < n and ri 2 < n. According to the induction 
hypothesis the sum over all nodes in these two subtrees, i.e. the sum over all nodes 
of T except the root, is bounded by (2ui — l)a-|- 6 logni! -I- (2n2 — l)a-|- 6 log 712 ! 
and thus the entire sum is bounded by 
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( 2 m — l)a+61ogni! + ( 2 rz 2 — l)a + 51ogn2! + a + 51og ( 

V ni 

= ( 2 (m + U 2 ) - l)a + 51ogni! + felogm! + 

51og(ni + ri 2 )! — blognil — 61 ogn 2 ! 

= (2n — l)a + b log n! 

which proves the lemma. □ 

3.2 Algorithms 

We first describe an algorithm that finds all right-maximal pairs in S with 
bounded gap using AVL trees to keep track of the elements in the leaf-lists 
during a traversal of the binary suffix tree Tb{S). We then extend it to find all 
maximal pairs in S with bounded gap using an additional AVL tree to filter out 
efficiently all right-maximal pairs that are not left-maximal. Both algorithms 
run in time 0(n log n -|- z) and space 0(n), where z is the number of reported 
pairs. In the following we assume, unless stated otherwise, that u is a node in the 
binary suffix tree Tb{S) with path-label a and children wi and W 2 named such 
that \LL{wi)\ < \LL{w 2 )\- We say that wi is the small child of v and that W 2 is 
the big child of v. 

Right- Maximal Pairs with Upper and Lower Bounded Gap To find all 
right-maximal pairs in S with gap between gii(|Q;|) and mdc^l) consider every 
node V in the binary suffix tree Tb{S) in a bottom-up fashion, e.g. during a depth- 
first traversal. At every node v we use AVL trees storing the leaf-lists LL{wi) 
and LL{w 2 ) at its two children to report the proper right-maximal pairs of its 
path-label a. The details are given in Algorithm Q and explained below. 

At every node v in Tb{S) we construct an AVL tree, the leaf-list tree T, 
that stores the elements in LL{v). If u is a leaf then we construct T directly 
in Step 1. If u is an internal node then LL(v) is the union of the disjoint leaf- 
lists LL(wi) and LL{w 2 ) which by assumption are stored in the already con- 
structed Ti and T 2 , so we construct T by merging Ti and T 2 , |Ti| < IT 2 I, using 
Lemma 01 Before constructing T in Step 2c we use T\ and T 2 to report right- 
maximal pairs from node v by reporting every p in LL(w\) against every q in 
LL{w 2 ) n L{p, |a|) and LL{w 2 ) H R{p, |a|). This is done in two steps. In Step 2a 
we find for every p in LL{w\) the minimum element qr{p) in LL{w 2 ) H R{p, |a|) 
and the minimum element qi{p) in LL{w 2 ) H L{p, |a|) by searching in T 2 us- 
ing LemmalSl In Step 2b we report pairs {p,q, |a|) and (g,p, |a|) for every p in 
LL{wi) and increasing g’s in LL{w 2 ) starting with qr{p) and qi{p) respectively, 
until the gap violates the upper or lower bound. 

To argue that Algorithm [D finds all right-maximal pairs with gap between 
gi(|a|) and g 2 {\oi\) it is enough to argue that we for every p in LL{w\) re- 
port all right-maximal pairs {p,q,\a\) and {q,p,\a\) with gap between gi(|a|) 
and g 2 {\oi\)- The rest follows because we at every node v in Tb{S) consider ev- 
ery p in LL{wi). Consider the call Report((?r(p),P + |a| + 52 (|a|)) in Step 2b. 




Finding Maximal Pairs with Bounded Gap 



141 



Algorithm 1 Find all right-maximal pairs in string S with bounded gap. 

1 . Initializing: Build the binary suffix tree Tb{S) and create at each leaf an AVL tree 
of size one that stores the index at the leaf. 

2 . Reporting and merging: When the AVL trees Ti and T2, |Ti| < |T2|, at the two 
children wi and W2 of node v with path-label a are available, we do the following: 

(a) Let {pi,P2, ... ,Ps} be the elements in T\ in sorted order. For each element p 
in Ti we find 

qr(p) = min{a; £T2 | a: > p + |q| -|- pi(|a|)} 
qi{p) = min{a; £T2 | a; > p - |q| - ff2(|a|)} 

by searching in T2 with the sorted lists {pi -\- |a| -|- pi(|a|) | i = 1 , 2 , . . . , s} and 
{pi — |a| — <72(|a|) I i = 1 , 2 , . . . , s} using Lemma |3 

(b) For each element p in Ti we do Report(gr(p),p + |o:| + <?2 (|q:|)) and 
Report(gi(p),p — |a| — pi(|q|)) where Report is the following procedure. 

def Report(/rom, to) : 
q = from 
while q < to : 

report pair (p, q, |a|) if p < g, and (g,p, |a|) otherwise 
q = next{q) 

(c) Build the leaf- list tree T at node v by merging Ti and T2 using Lemma 0 



From the implementation of Report follows that this call reports p against ev- 
ery q in LL{w2) H [qr{p) --P + |o;| + g 2 (|cr|)]- By construction of qr{p) and def- 
inition of i?(p, |a|) follows that LL{w2) n [qr{p) --P + |a| + 52 (|q;|)] is equal to 
LL{w2)r\R{p, |a|), so the call reports all pairs (p,q,\a\) with gap between 5 i(|a|) 
and g 2 {\oi\). Similarly we can argue that the call Report(g/(p),p — |a| — gi(|o:|)) 
reports all pairs {q,p, |o:|) with gap between gi(|a|) and g 2 {\a\). 

Now consider the running time of Algorithm E Building the binary suffix 
tree Tb{S) and creating an AVL tree of size one at each leaf in Step 1 takes 
time 0{n). At every internal node in Tb{S) we do Step 2. Since |Ti| < |T 2 | 
searching in Step 2a and merging in Step 2c takes time 0(log by 

Lemmas 0 01 respectively. Reporting of pairs in Step 2b takes time pro- 

portional to |Ti|, because we consider every p in LL(wi), plus the number of 
reported pairs. Summing this over all nodes gives by Lemma 0 that the total 
running time is 0(n log n -|- z), where z is the number of reported pairs. Since 
constructing and keeping Tb{S) requires space 0(n), and since no element at 
any time is in more than one leaf-list tree, Algorithm Q requires space 0(n). 



Theorem 7. Algorithm^ finds all right-maximal pairs {i,j, |o:|) in a string S 
with gap between gi(|o;|) and g 2 {\cy\) in spaee 0(ji) and time Oljilogn z), 
where z is the number of reported pairs and n is the length of S. 
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Maximal Pairs with Upper and Lower Bounded Gap We now turn to- 
wards finding all maximal pairs in S with gap between (;i(|a|) and g 2 {\a\)- Our 
approach to find all maximal pairs in S with gap between (;i(|q;|) and g 2 (|a;|) is to 
extend Algorithm^to filter out all right-maximal pairs that are not left-maximal. 
A simple solution is to extend the procedure Report to check if 5[p — 1] ^ 5[g — 1] 
before reporting the pair {p,q, |a|) or (g,p, |a|) in Step 2b. This solution takes 
time proportional to the number of inspected right-maximal pairs, and not time 
proportional to the number of reported maximal pairs. Even though the max- 
imum number of right-maximal pairs and maximal pairs in strings of a given 
length are asymptotically equal, many strings contain significantly fewer max- 
imal pairs than right-maximal pairs. We therefore want to filter out all right- 
maximal pairs that are not left-maximal without inspecting all right-maximal 
pairs. In the remainder of this section we describe one way to do this. 

Consider the reporting step in Algorithm ^ and assume that we are about to 
report from a node v with children wi and W 2 - The leaf- list trees T\ and T 2 , 

1 2d I < 1 22 1, are available and they make it possible to access the elements 
in LL{wi) = {pi,P 2 , • ■ ■ ,Ps} and LL{w 2 ) = {qi,q 2 , ■ ■ ■ , qt} in sorted order. We 
divide the sorted leaf-list LL{w 2 ) into blocks of contiguous elements such that 
the elements qi-\ and qi are in the same block if and only if — 1] = — 1]. 

We say that we divide the sorted leaf-list into blocks of elements with equal left- 
characters. To filter out all right-maximal pairs that are not left-maximal we 
must avoid to report p in LL{wi) against any element q in LL(w 2 ) in a block of 
elements with left-character S[p— 1]. This gives the overall idea of the extended 
algorithm; we extend the reporting step in Algorithm^ such that whenever we 
are about to report p in LL{w\) against q in LL{w 2 ) where iS'[p — 1] = S[q — 1] 
we skip all elements in the current block containing q and continue reporting p 
against the first element q' in the following block, which by the definition of 
blocks satisfies that S[p— 1] yf <5'[g' — 1]. 

To implement this extended reporting step efficiently we must be able to 
skip all elements in a block without inspecting each of them. We achieve this 
by constructing an additional AVL tree, the block-start tree, that keeps track of 
the blocks in the leaf-list. At each node v during the traversal of Tb{S) we thus 
construct two AVL trees; the leaf- list tree T that stores the elements in LL{v), 
and the block-start tree B that keeps track of the blocks in the sorted leaf-list 
by storing all the elements in LL{v) that start a block. We keep links from the 
block-start tree to the leaf-list tree such that we in constant time can go from an 
element in the block-start tree to the corresponding element in the leaf-list tree. 
Figure 0 illustrates the leaf- list tree, the block-start tree and the links between 
them. Before we present the extended algorithm and explain how to use the 
block-start tree to efficiently skip all elements in a block, we first describe how 
to construct the leaf-list tree T and block-start tree B at node v from the leaf-list 
trees, T\ and T 2 , and block-start trees, B\ and B 2 , at its two children w\ and W 2 - 

Since LL{v) is the union of the disjoint leaf-lists LL{wi) and LL(w 2 ) stored 
in T\ and T 2 respectively, we can construct the leaf-list tree T by merging T\ 
and T 2 using Lemma El It is more involved to construct the block-start tree B. 
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Fig. 3. The data structure constructed at each node v in Tb{S). The leaf-list 
tree T stores all elements in LL(v). The block-start tree B stores all elements 
in LL{v) that start a block in the sorted leaf-list. We keep links from the elements 
in the block-start tree to the corresponding elements in the leaf- list tree. 



The reason is that an element pi that starts a block in LL(wi) or an element qj 
that starts a block in LL{w2) does not necessarily start a block in LL{v) and vice 
versa, so we cannot construct B by merging B\ and B2- Let {ei, 62, . . . , eg+t} 
be the elements in LL(v) in sorted order. By definition the block-start tree B 
contains all elements Cfc in LL{v) where — 1 ] yf S[ek — 1 ]. We construct B 

by modifying B2- We choose to modify B2, and not i?i, because \LL{wi)\ < 
\LL{w2)\, which by the “smaller-half trick” allows us to consider all elements 
in LL(wi) without spending too much time in total. To modify B2 to become B 
we must identify all the elements that are in B but not in B2 and vice versa. 

Lemma 8. If ek is in B hut not in B2 then Ck € LL{w\) or Ck-i G LL{w\). 

Proof. Assume that ek is in B and that Ck and Ck-i both are in LL(w2). 
In LL{w2) the elements ek and efc_i are neighboring elements qj and qj-i- 
Since Cfc starts a block in LL{v) then S[qj — 1 ] = S[ek — 1 ] fy 5 '[efc_i — 1 ] = 
— 1 ]. This shows that qj = ek is in B2 and the lemma follows. □ 

Let NEW be the set of elements Cfc in B where Ck or efe_i are in LL{wi). It 
follows from Lemma 0 that this set contains at least all elements in B that are 
not in B2- It is easy to see that we can construct NEW in sorted order while 
merging Ti and T2; whenever an element Ck from Ti, i.e. LL{wi), is placed in T, 
i.e. LL(v), we include it, and/or the next element e^+i placed in T, in NEW if 
they start a block in LL(y). 

If we insert the elements in NEW we are halfway done modifying B2 to 
become B. We still need to identify and remove the elements that should be 
removed from B2, that is, the elements that are in B2 but not in B. 

Lemma 9. An element qj in B2 is not in B if and only if the largest element Ck 
in NEW smaller than qj in B2 has the same left- character as qj. 

Proof. If qj is in B2 but does not start a block in LL(v), then it must be in a 
block started by some element Ck with the same left-character as qj . This block 
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cannot contain qj-i because qj being in B 2 implies that S[qj — 1] ^[9j-i ~ !]• 

We thus have the ordering < eu < qj- This implies that Ck is the largest 
element in NEW smaller than qj. If Cfc is the largest element in NEW smaller 
than qj, then no block starts in LL(v) between Cfc and qj, i.e. all elements e in 
LL{v) where < e < qj satisfy that S'[e— 1] = S'[efe — 1], so if 5[efc — 1] = — 1] 

then qj does not start a block in LL(v). □ 

By searching in B2 with the sorted list NEW using Lemma Hit is straight- 
forward to find all pairs of elements (ek,qj), where Ck is the largest element in 
NEW smaller than qj in i? 2 - If the left-characters of Cfc and qj in such a pair 
are equal, i.e. S[ek — 1] = S[qj — 1], then by Lemma Elthe element qj is not in B 
and must therefore be removed from B2 ■ It follows from the proof of Lemma 0 
that if this is the case then qj~i < 6k < qj, so we can, without destroying the 
order among the nodes in B2, remove qj from B2 and insert instead, simply 
by replacing the element qj with the element at the node storing qj in i? 2 - 

We can now summarize the three steps it takes to modify B2 to become B. 
In Step 1 we construct the sorted set NEW that contains all elements in B 
that are not in B 2 . This is done while merging T\ and T 2 using Lemma 0 In 
Step 2 we remove the elements from B2 that are not in B. The elements in B2 
being removed and the elements from NEW replacing them are identified using 
Lemmas 0 and E In Step 3 we merge the remaining elements in NEW into the 
modified B2 using Lemma 0 Adding links from the new elements in B to the 
corresponding elements in T can be done while replacing and merging in Steps 2 
and 3. Since \NEW\ < 2 |Ti| and IB 2 I < IT 2 I, the time it takes to construct B 
is dominated by the the time it takes merge a sorted list of size 2 |Ti| into an 
AVL tree of size |T 2 |- By Lemma E] this is within a constant factor of the time it 
takes to merge T\ and T 2 , so the time is takes to construct B is dominated by 
the time it takes to construct the leaf-list tree T. 

Now that we know how to construct the leaf-list tree T and block-start tree B 
at node v from the leaf-list trees, Ti and T 2 , and block-start trees, Bi and B2, 
at its two children wi and W2, we can proceed with the implementation of the 
extended reporting step. The details are shown in Algorithm El This algorithm 
is similar to Algorithm 0 except that we at every node v in Tb{S) construct 
two AVL trees; the leaf- list tree T that stores the elements in LL{y), and the 
block-start tree B that keeps track of the blocks in LL{y) by storing the subset 
of elements that start a block. If u is a leaf, we construct T and B directly. If v 
is an internal node, we construct T by merging the leaf-list trees T\ and T 2 at 
its two children w\ and W2, and we construct B by modifying the block-start 
tree B2 as explained above. 

Before constructing T and B we report all maximal pairs from node v with 
gap between 5 i(|a|) and ( 72 (fy|) by reporting every p in LL{w\) against every q in 
LL{w 2 )C\L{p, |a|) and LL{w 2 )C\R{p, |a|) where S'[p— 1] fy 1]. This is done in 
two steps. In Step 2a we find for every p in LL{wi) the minimum elements qi{p) 
and qr(p), as well as the minimum elements bi{p) and br{p) that start a block, in 
LL{w2)r\L{p, |a|) and LL{w2)r\R{p, |a|) respectively. This is done by searching 
in T 2 and B2 using LemmaEl In Step 2b we report pairs {p,q, |a|) and (q,p, |a|) 
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Algorithm 2 Find all maximal pairs in string S with bounded gap. 

1 . Initializing: Build the binary suffix tree Tb{S) and create at each leaf two AVL 
trees of size one, the leaf-list and the block-start tree, both storing the index at 
the leaf. 

2 . Reporting and merging: When the leaf-list trees Ti and T2, |Ti| < IT2I, and the 
block-start trees Bi and B2 at the two children and IU2 of node v with path-label 
a are available, we do the following: 

(a) Let {pi,p2, ... be the elements in Ti in sorted order. For each element p 
in Ti we find 

grip) = min{a; G T2 | a; > p -|- |a| - 1 - pi(|a|)} 
qi{p) = min{a: £ T2 | a; > p - |a| - P2(|a|)} 
br(p) = min{a: G B2 | a: > p -I- |a| -I- pi(|a|)} 
bi{p) = min{a: £ B2 | a: > p - |a| - P2(|a|)} 

by searching in T2 and B2 with the sorted lists {pi + |a| -I- pi(|q|) | i = 
1 , 2 ,... , s} and {pi — |a| — g2{\o-\) | i = 1 , 2 , . . . , s} using LemmaEl 

(b) For each element p in Ti we do ReportMax(qr(p), 6 r(p),p + |o!| + '?2(|o:|)) and 
ReportMax(gi(p), 6 ;(p),p— |a| — gi(|a|)) where ReportMax is the following pro- 
cedure. 

def ReportMax(/rom_r, fromJB, to)-, 
q = from.T 
b = fromJB 
while q < to 

\iS[q-l]^S[p-l]-. 

report pair (p, q, |a|) if p < q, and (q,p, |a|) otherwise 
q = next{q) 

else: 

while b < q: 

b = nextip) 
q^b 

(c) Build the leaf-list tree T at node v by merging Ti and T2 using Lemma 0 
Build the block-start tree B at node v by modifying B2 as described in the 
text. 



for every p in LL{wi) and increasing q's in LL{w 2 ) starting with qr{p) and qi{p) 
respectively, until the gap violates the upper or lower bound. Whenever we are 
about to report p against q where S'[p — 1] = — 1], we instead use the block- 

start tree B 2 to skip all elements in the block containing q and continue with 
reporting p against the first element in the following block. 

To argue that Algorithm El finds all maximal pairs with gap between gi(|a|) 
and q 2 (|cr|) it is enough to argue that we for every p in LL(w\) report all maxi- 
mal pairs (p, g, |a|) and (g,p, |o:|) with gap between gi(|o:|) and g 2 {\a\). The rest 
follows because we at every node in Tb{S) consider every p in LL{wi). Consider 
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the call ReportMax(gr(j3), + lal + ff 2 (|ci!|)) in Step 2b. From the imple- 

mentation of Report Max follows that unless we skip elements by increasing b 
then we consider every q in LL{w 2 ) H R{p, |a|). The test S[q — 1] ^ «S'[p — 1] 
before reporting a pair ensures that we only report maximal pairs and when- 
ever — 1] = — 1] we increase b until b — minja; G B 2 \ x > q}. This 

is, by construction of B 2 and br{p), the element that starts the block follow- 
ing the block containing q, so all elements q' , q < q' < b, we skip by set- 
ting q to b satisfy that S'[p — 1] = — 1] = S'[g' — 1]. We thus conclude that 

ReportMax(gr(p), 6r(p),P + |cr| -I- g 2 (|o!|)) reports p against exactly those g in 
LL{w 2 ) n R{p, |a|) where — 1] ^ S'[g — 1], i.e. it reports all maximal pairs 
(p, g, |a|) at node v with gap between gi(|cr|) and P 2 (|Q!|). Similarly, the call 
ReportMax(g/(p),6;(p),p— |q;| — gi(|a|)) reports all maximal pairs (g,p, |a|) with 
gap between pi(|a|) and p 2 (|cr|). 

Now consider the running time of Algorithm |21 We first argue that the call 
ReportMax(gr(p), 5j.(p),p -I- |a| +g 2 (|a|)) takes constant time plus time propor- 
tional to the number of reported pairs (p, g, |a|). To do this all we have to show 
is that the time used to skip blocks, i.e. the number of times we increase b, is 
proportional to the number of reported pairs. By construction br{p) > Qr{p), 
so the number of times we increase b is bounded by the number of blocks in 
LL{w 2 )C)R{p, |a|). Since neighboring blocks contain elements with different left- 
characters, we report p against an element from at least every second block in 
LL(w 2 ) n i?(p, |a|). The number of times we increase b is thus proportional to 
the number of reported pairs. The call ReportMax(g/(p), 6;(p),p — |a| — pi(|a|)) 
also takes constant time plus time proportional to the number of reported pairs 
(g,p, |q!|). We thus have that Step 2b takes time proportional to |Ti| plus the 
number of reported pairs. Everything else we do at node v, i.e. searching in T 2 
and B 2 and constructing the leaf-list tree T and block-start tree B, takes time 
0(log Summing this over all nodes gives by Lemma0that the total 

running time of the algorithm is O (n log n + z) where z is the number of reported 
pairs. Since constructing and keeping Tb{S) requires space 0{n), and since no 
element at any time is in more than one leaf-list tree, and maybe one block-start 
tree. Algorithm El requires space 0{n). 

Theorem 10. Algorithm^ finds all maximal pairs |o|) in a string S with 
gap between gi(|a|) and g 2 (|<a|) in space 0{n) and time 0{nlogn + z), where z 
is the number of reported pairs and n is the length of S. 

We observe that Algorithm El never uses the block-start tree B\ at the small 
child rci . This observation can be used to ensure that only one block-start tree ex- 
ists during the execution of the algorithm. If we implement the traversal of Tb{S) 
as a depth-first traversal in which we at each node v first recursively traverse the 
subtree rooted at the small child w±, then we do not need to store the block-start 
tree returned by this recursive traversal while recursively traversing the subtree 
rooted at the big child W 2 - This implies that only one block-start tree exists at 
all times during the recursive traversal of Tb{S). The drawback is that we at 
each node v need to know in advance which child is the small child, but this 
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knowledge can be obtained in linear time by annotating each node with the size 
of the subtree it roots. 

4 Pairs with Lower Bounded Gap 

If we relax the constraint on the gap and only want to find all maximal pairs 
in S with gap at least g(|a|), where g is a function that can be computed 
in constant time, then a straightforward solution is to use Algorithm 0 with 
gi(|a|) = 5 (|q!|) and g 2 (|cr|) = n. This obviously finds all maximal pairs with 
gap at least 5 i(|a|) = (7(|o|) in time 0(jilogn + z). However, the missing upper 
bound on the gap, i.e. the trivial upper bound g 2 {\oi\) = n, makes it possible to 
reduce the running time to 0(n + z) since reporting from each node during the 
traversal of the binary suffix tree is simplified. 

The reporting of pairs from node v with children wi and W 2 is simplified, 
because the lack of an upper bound on the gap implies that we do not have 
to search LL{w 2 ) for the first element to report against the current element 
in LL(wi). Instead we can start by reporting the current element in LL(wi) 
against the biggest (and smallest) element in LL(w 2 ) and then continue report- 
ing it against decreasing (and increasing) elements from LL(w 2 ) until the gap 
becomes smaller than g(|a|). Unfortunately this simplification alone does not re- 
duce the asymptotic running time because inspecting every element in LL(wi) 
and keeping track of the leaf-lists in AVL trees alone requires time 0(nlogn). To 
reduce the running time we must thus avoid to inspect every element in LL(wi) 
and find another way to store the leaf- lists. 

We achieve this by using a data structure based on heap-ordered trees to 
store the leaf-lists during the traversal of the binary suffix tree. The key feature 
of the data structure is that it allows us to merge two trees in amortized constant 
time. The details of the data structure and the methods using it to find pairs 
with gap at least gdaj) is given in |21 Sect. 4]. Here we just summarize the result. 

Theorem 11. All maximal pairs {i,j, |a|) in a string S with gap at least 5 (|q!|) 
can be found in space 0{n) and time 0{n + z), where z is the number of reported 
pairs and n is the length of S. 

5 Conclusion 

We have presented efficient and ffexible methods to find all maximal pairs 
(i,j,\a\) in a string under various constraints on the gap j — i — \a\. li the gap 
is required to be between gi(|a|) and g 2 {\a\), the running time is 0{nlogn + z) 
where n is the length of the string and z is the number of reported pairs. If the 
gap is only required to be at least ( 7 i(|a|), the running time reduces to 0{n + z). 
In both cases we use space 0{n). 

In some cases it might be interesting only to find maximal pairs {i,j, |a|) 
fulfilling additional requirements on |o|, e.g. to filter out pairs of short substrings. 
This is straightforward to do using our methods by only reporting from the nodes 
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in the binary suffix tree whose path-label a fulfills the requirements on |o!|. In 
other cases it might be of interest just to find the vocabulary of substrings that 
occur in maximal pairs. This is also straightforward to do using our methods by 
just reporting the path-label a of a node if we can report one or more maximal 
pairs from the node. 

Instead of just looking for maximal pairs, it could be interesting to look 
for an array of occurrences of the same substring in which the gap between 
consecutive occurrences is bounded by some constants. This problem requires a 
suitable definition of a maximal array. One definition and approach is presented 
in 12 ill . Another definition inspired by the definition of a maximal pair could 
be to require that every pair of occurrences in the array is a maximal pair. 
This definition seems very restrictive. A more relaxed definition could be to only 
require that we cannot extend all the occurrences in the array to the left or to 
the right without destroying at least one pair of occurrences in the array. 
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Abstract. This paper proposes a simple data structure, called a prefix 
list, which maintains all prefixes of a string in reverse lexicographic order. 
It can be on-line incrementally constructed in time and space linear in 
the string length. It is strongly related to suffix trees and suffix arrays, 
and may share applications with these existing structures. A suffix array 
can be built via the corresponding prefix list in linear time. Particular 
applications of the prefix list lie in source-coding problems that require 
on-line right-to-left string matching. We apply the prefix list to on-line 
estimation of source entropy and to context-based symbol-ranking text 
compression algorithms. 



1 Introduction 

We propose a simple data structure, called the prefix list, which can store all 
prefixes of a string in reverse lexicographic order. A prefix list can be on-line 
incrementally constructed in time and space linear in the string length. We can 
apply it to string matching problems and to data compression algorithms. 

The proposed data structure is deeply related to such index structures as 
suffix trees g, um, m and suffix arrays g, |E|. The suffix array for a text is 
an array of integers which represent lexicographic orders of all suffixes of the 
text. It was proposed as a space-efficient alternative to the more ubiquitous 
suffix tree. Whether we use suffix trees or suffix arrays, we usually suppose a 
text to be static and fixed in the sense that we preprocess it to accept multiple 
queries afterwards. In particular, a suffix array must be constructed from scratch 
even if a bit of modification is added to the text. In some actual situations, we 
must answer index-based string-matching problems while incrementally reading 
a text. A suffix tree, which can be constructed in an on-line manner, serves as 
a strong tool in such situations. However, since strings are represented in one 
direction from the root to leaves on a suffix tree, it is difficult to match strings 
from right to left. We actually have some on-line string problems, in which we 
should match strings in that direction. 

* Partially supported by the Kayamori foundation of informational science advance- 
ment and by the Okawa foundation for information and telecommunications. 
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Such situations sometimes arise in the context modeling stage in text com- 
pression. In most symbolwise (predictive) text compression algorithms, an up- 
coming symbol is predicted on the basis of its context. Such a “context-based” 
method gathers previous contexts according to their similarities to the current 
context. This requires an on-line right-to-left string matching process. Actually, 
the prefix list presented in this paper was initially suggested as an adaptive im- 
plementation of the context table, which was proposed as a common basis for 
representing text compression algorithms nni. The most straightforward appli- 
cation of the prefix list is the context-sorting text compression algorithm H3 
It virtually prepares a ranked list of all possible symbols, ordered from most 
likely to least likely, but actually gives ranks to symbol candidates by searching 
a sorted list of previous contexts, sorted in reverse lexicographic order. In Matias 
et al. P], in which the implementation of similar algorithms was referred to as 
the HYZ compression problem, the authors proposed to augment suffix trees to 
solve the problem. 

A prefix list represents every prefix in a string as a node in a doubly-linked 
linear list. It is similar to the suffix tree in that on-line incremental construction is 
possible, and to the suffix array in that lexicographic linear order is incorporated. 
It seems that we need O(n^) time to construct a prefix list from a text of length n. 
However, if the text is an output from a finite-order Markov source, the expected 
complexity is reduced to 0(n). For a pattern generated from the same source, 
we can match it with the text in time linear in the pattern length. 

In the next section, we define the prefix list and give an on-line procedure for 
its construction. We show that we can build it from a Markovian text in linear 
time. In Sectional we slightly augment the prefix list to apply it to estimating 
the entropy of an actual text. Section 4 briefly reviews the context-sorting text 
compression algorithm, which motivates the development of prefix list. Section 
5 is a survey of other possible applications. 



2 Proposed Data Structure and Its Construction 

Let 

S')!..?!] = SiS2 ■ ■ ■ Sn (Si G Y, I < ? < n) (I) 

be an ??-symbol string over an ordered alphabet U of size |i7|. We represent a 
substring Sj • • • Sj as S[i..j] and define S[i..j] = e, the empty string, for i > j. 
The prefix of a string S[l..n] that ends at position i is S[l..z], and the suffix that 
begins at position i is S[i..n]. The ?th symbol Si is also denoted by S[?j. 

Based on the ordering relation on S, we can define its associated lexico- 
graphic order on the set of all strings. Reverse lexicographic ordering is lex- 
icographic ordering of reversed strings. For example, the word ‘dog’ reverse- 
lexicographically {re-lexically, for short) precedes the word ‘deer’ since ‘god’ 
lexically precedes ‘reed’. Our new data structure maintains a re-lexically sorted 
set of all the prefixes of S')!..??]. As an example, consider the string: 



S)1..9] = yabrecabr, 



(2) 



152 



Hidetoshi Yokoo 



S'[1..0] = e 

= yabreca 
S[1..2] = ya 

S[1..8] = yabrecab 
S[1..3] = yab 

S'[1..6] = yabrec 
S[1..5] = yabre 
S'[1..9] = yabrecabr 
S[1..4] = yabr 
= y 

Fig. 1. Re-lexically sorted list of 
prefixes of S'[1..9] = ‘yabrecabr’. 



sorted 
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Fig. 2. Inserting the next prefix. The up- 
coming symbol is compared with the follow- 
ing symbols. 



in which the ten prefixes including the empty string are re-lexically sorted in the 
order shown in Fig.Q 

For a pair of two adjacent prefixes in a re-lexically sorted list of prefixes, if a 
prefix 5[l..j] immediately follows then the prefix 5'[l..j] is said to be the 

immediate successor of S'[l..i] and, conversely, S'[l..i] is said to be the immediate 
predecessor of S'[l..j]. We can insert an upcoming prefix into the re-lexically 
sorted list of previously occurred prefixes in a simple way. Consider again the 
example in 0, which may be followed by some symbol siq. If the same symbol 
as sio has not appeared so far, then the reverse lexicographic order of S'[1..10] 
is determined only by its last symbol siq. Otherwise, that is, if we have already 
had sio in 5[1..9], then we can find the position of «S'[1..10] by searching the 
so-far occurred symbols for the match with siq. As shown in Fig. El starting 
from the position corresponding to sio, we search bidirectionally the following 
symbols of the sorted prefixes for the same symbol as siq. If we hit a symbol 
Si = sio in the t direction, we should insert S'[1..10] as the immediate successor 
of S'[l..i]. Conversely, if we find the same symbol as sio in the J, direction then 
we should insert S'[1..10] as the immediate predecessor of S'[l..i]. These can be 
validated by the recursive property of reverse lexicographic order. As a specific 
example, suppose that we have sio = ‘e’ in the example in Fig. El Then, we 
immediately reach S 5 = ‘e’ in the J, direction. This implies that we should insert 
S'[1..10] = ‘yabrecabre’ as the immediate predecessor of 5'[1..5]. Another case 
may have sio = ‘a’. In this case, we hit either sy = ‘a’ in the t direction or S 2 = ‘a’ 
in the | direction. In any of both cases, the position of S'[I..I0] = ‘yabrecabra’ 
in the re-lexically sorted list of these prefixes is known to be between S'[1..7] and 

The prefix list is natural realization of the above idea. It is implemented as a 
doubly-linked linear list in which each element, or node, contains one integer and 
three pointers. The three pointers are pred and succ list pointers used to organize 
the doubly-linked list and next pointer used to designate the next position in 
the input string. Every prefix in a string is represented by a node of the list. If a 
node corresponds to the ith prefix S'[I..i], then its integer field contains the value 



A Dynamic Data Structure for Reverse Lexicographically Sorted Prefixes 



153 



P'f.pred- 



1 




' •'1 ^ 

^ P"f. indx 


1 


r 



•-P^.succ 



P'f.nerct 

Fig. 3. Node representing 
S[l..P'[.indx]. 
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#9 #3 #4 #7 #8 #1 #5 nil #6 #2 

Fig. 4. Prefix list for ‘yabrecabr’. 



of position index i. Its pred and succ pointers point to nodes corresponding to 
the immediate predecessor and immediate successor of respectively. The 

next pointer in the node for points to the node for + 1]. If a node 

corresponds to the entire string S'[I..n], then its next pointer is set to nil. The 
initial state of a prefix list consists of a single special node H, which represents 
the empty string. We may add an extra node T into the end of a prefix list 
in order to simplify some list operations. If we schematically represent a node 
pointed to by a pointer P as is shown in Fig.0 in which the left (^) and right 
(— >) arrows represent the pred and succ pointers, respectively, and the vertical 
arrow (|) represents the next pointer, then our sample string in J2) is represented 
by the list shown in Fig.^l 

As mentioned above, a prefix list can be constructed incrementally in an on- 
line manner. Assume that the list representing all prefixes of an initial segment 
S'[l..i] has been already constructed and that the i + 1st prefix -|- 1] is 

about to be inserted. Let P be a pointer that points to the just-inserted node 
for S'[l..i]. If the upcoming symbol s^+i alphabetically precedes or succeeds any 
symbol seen so far, then the node for S'[l..i-|- 1] should be inserted into the right 
of the list head {H) or the left of the list tail (T), respectively. Otherwise, if 
the symbol is not included in S'[1..7], then the list has a unique position Q 
where the corresponding node Qt satisfies 

S[Q^.indx\ A Si+i A S[Q^.succ"\.indx\. (3) 

Here, ‘a’ denotes the alphabetic order on S. We should insert a new node 
between the two nodes pointed to by Q and Q'l.succ. 

If the same symbol as has already appeared in S'[1..7], the inequalities in 

may hold with equality. In this case, in the re-lexically sorted list of prefixes 
of «S'[1..* -I- 1], the immediate predecessor or successor of -I- 1] has the same 
last symbol as Si+i. If the immediate predecessor -I- 1] of -I- 1] has 

the same last symbol Sj+i as Si+i (0 < j < i), then 5'[l..j] re-lexically precedes 
S'[l..i]. The node corresponding to 5'[l..j] should be the first node with the same 
following symbol Sj+i as Si+i when traversing the list from the current node to 
the head. We can see whether the following symbol matches s^+i by traversing 
the next pointer. Conversely, if the last symbol of the immediate successor 
S'[l..j-|-1] of ^[l. .7-1-1] is equal to Si+i, then the node for 5'[l..j] should be the first 
node satisfying Sj+i = when traversing the list from the current node to the 
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re-lexically sorted contexts 

Fig. 5. Re-lexically sorted contexts and their following symbols. 
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tail. Thus, starting from the current node for we search bidirectionally 

the list for the node for while comparing the following symbols with s^+i. 

Once we have found the node for we can immediately reach the node for 

-|- 1] via the next pointer. Then, the position where we should insert a new 
node representing -|- 1] is adjacent to that node for -I- 1]. 

Now, we show that a prefix list can be constructed in linear time if the string 
in question is drawn from a Markov source of finite order. In such a string, 
the fcth symbol can be completely characterized by the conditional probabilities 
{Pr(sfe|S'[fc — m..k — 1]) \ Sk G S, S[k — m..k — 1] G H"*}, where Pr(sfe | S[k — 
m..k — 1]) is the conditional probability of Sk given — m..k — 1]. We say that 
the fcth symbol Sk occurs in the context iS'[fc — m..k — 1]. 

Suppose that, for sufficiently large i, we are about to insert the node corre- 
sponding to ^[l. . 1 - 1 - 1 ]. To do it, we search the list bidirectionally for the match 
of Si+i. We assume that the search is performed in both directions alternately. 
Figure Elshows that the symbol-comparisons with s^+i are done in Xi, a; 2 , 2 : 3 , . . . 
order. We evaluate the number of symbol-comparisons in this search. Let Kg 
be the total number of symbols compared until we reach the match s^+i = s, 
and C denote the longest common context of those symbols. Then, the expected 
number of Kg is estimated as 



E{Kg} 



1 

Pr(sj+i = s I C) ■ 



( 4 ) 



Conversely, for any context C, the expected number of tested symbols over all 
possible upcoming symbols becomes 



E{K} = ^Pr(s,+i = s I C) ■ E{Kg} 
= |{s G r I Pr(s I C) > 0}| 
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Table 1. Number of symbol-comparisons required to insert each prefix. 



FILE 


size 

(bytes) 


number of 
distinct 
symbols 


number of symbol- 
comparisons 


maximum 


average 


rand 


500000 


53 


827 


52.48 


paper 1 


53161 


95 


36481 


42.24 


bib 


111261 


81 


47442 


29.69 


alice29.txt 


152089 


74 


34520 


24.53 


news 


377109 


98 


200587 


78.28 


plrabnl2.txt 


481861 


81 


164435 


23.11 


book2 


610856 


96 


459095 


70.11 


bookl 


768771 


82 


419926 


33.10 


obj2 


246814 


256 


119013 


124.55 


pic 


513216 


159 


287924 


46.95 


kennedy.xls 


1029744 


256 


671660 


124.36 



Here, 

a=|{aer|P(a)>0}| (6) 

denotes the number of symbols with non-zero probability. Therefore, the ex- 
pected time complexity of the construction of a prefix list is linear in the string 
length with the coefficient of 



^ Pr(C) (Tc < cr. (7) 

Note that the equality in (0 holds when a data string is drawn from a memoryless 
source. 

The above estimation is valid for Markovian data of finite order. In order to 
validate it on actual data, we performed simple measurement on the number of 
symbol-comparisons on artificial and natural data. In actual situations, cr, the 
number of symbols with non-zero probability, can be regarded as the number 
of distinct symbols occurred in the string. Thus, we make comparisons between 
the number of distinct symbols actually occurred and the number of symbol- 
comparisons required in the insertion of the prefixes. Table Eshows some results 
of the measurement. 

The first file, “rand,” consists of lower- and upper-case letters and spaces, 
totally 53 distinct symbols. We assigned random but fixed probabilities to these 
symbols, and generated a sample sequence of 500000 symbols. Thus, the file 
can be thought of as a realization of a memoryless source. The other files come 
from the Calgary |2| and Canterbury P corpora, both of which are collected 
as standard data for the evaluation of text compression algorithms. The “size” 
column in Table [D includes the length of each file. The number of distinct sym- 
bols in each file, which corresponds to cr, is shown in the third column. The 
fourth column represents the maximum number of symbol-comparisons required 
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in searching the list for each symbol when we insert prefixes into the list. The 
last column is the average number of symbol-comparisons. Obviously, it never 
exceeds the corresponding number of distinct symbols in any file. On the file 
“rand,” both numbers are almost the same, which is naturally expected on data 
from a memory less source. 



3 On-Line Computation of the Shortest Unique 

Substrings with an Application to Entropy Estimation 

Figure El shows an extension of our data structure, where we add an auxiliary 
quantity to each node which represents the length of the longest common suffix 
of the strings corresponding to the node and to its immediate successor. This 
quantity serves as a measure for context similarity between two contexts which 
are adjacent to each other in a re-lexically sorted list of contexts. In this section, 
we describe on-line computation of the quantities and its application to the 
estimation of the entropy of a data source. 

Let S'[l..j] denote the immediate successor of the ith prefix 5[l..t]. Letting k 
be the maximum I such that -I- 1 — Z] = -I- 1 — Z], we add it to the node 

corresponding to «S'[1..*]. Assume that the node has both immediate successor 
S'[l..j] and immediate predecessor S'[l..fc] just after inserting that node (1 <j <i, 
I <k <i). This implies that, immediately before the insertion of the node, the 
two nodes S'[l..fc] and 5'[l..j] are directly adjacent to each other. Suppose that 
these two nodes have had Ik and Zj, respectively, as shown in Fig.0 The state 
in Fig. Q may be changed into a new one by the insertion of the node for S'[l..f]. 
As shown in Fig. 0 the value of Ij remains unchanged while the value of Ik may 
increase to Vf.. These satisfy both Z^ > Ik and l'k> h- More specifically. 

If I'f. > Ik then li = Ik otherwise li > Ik', 

If Zi > Ik then I'f, = Ik otherwise I'k > h- 

e cyrba = a 

c y a = a e r b = b 
a=a b = b c e r = r y 




Fig. 6. Context similarities with the immediate successors (lengths of the com- 
mon suffixes). 




Fig. 7. A pair of adjacent nodes. 



Fig. 8. After inserting the ith prefix. 
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Fig. 9. On-line entropy estimation. The text is four Jane Austen’s novels, con- 
sisting of 2,364,200 letters. The final result, 1.694 [bits/byte], estimated from the 
entire text, is 0.055 bits better than the estimate of Kontoyiannis et al. 



We can use the above relations to minimize the number of actual comparisons 
required to compute and U. 

Our first application of the prefix list is the estimation of entropy of actual 
data. As is well known in information theory, the data compression limit of a 
string is given by the entropy of its source. Of the methods for estimating entropy 
from sample data, the ones most related to our method are the SWE (Sliding- 
window Entropy) estimator and Grassberger’s estimator ^3!- Our estimate US! 
from a string S')!../!] is defined by 



n 

Hn = nlognfy^L,^ 



( 8 ) 



where Li is the minimum I such that a copy of the substring — Z-|-l..i] does 
not appear anywhere else in the string. Thus, Li represents the length of the 
shortest unique substring ending at position i. For the immediate predecessor 
S')!../::] of 5'[l..z], the value of Li can be calculated as 

Lj = maxjZj, Ik} + 1. 



Combining Li with the equation Q, we can perform entropy estimation in an 
on-line manner. 

Figure El shows an example of entropy estimation, where the estimate Hn is 
plotted as a function of the input length n. The sample text is a concatenation 
of four Jane Austen’s novels |3, which is the same as that used by Kontoyiannis 
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et al. Compared with their results, we know that our method is not only efficient 
but also provides very good estimates. However, since the problem of estimating 
the entropy of English text itself is not the main focus of this paper, we will 
discuss our estimates elsewhere. 



4 Implementing the Context-Sorting Text Compression 
Algorithm 

The context-sorting text compression algorithm m is an on-line data compres- 
sion method, which can be regarded as a kind of symbol-ranking compressors 
It is important in that it connects the block-sorting compression method men- 
tioned in the next section with Lempel-Ziv-type dictionary-based methods d, 
0. Although the context-sorting compression algorithm is asymptotically op- 
timal for data from a finite-order Markov source, its existing implementation 
is naive and quite slow. In our previous implementation we maintained re- 
lexically sorted contexts using a binary search tree. We had to limit the length of 
context and to consume time proportional to that bounded length. These have 
prevented us from introducing more sophisticated codes into the coding stage. 

In the context-sorting method, we enumerate previous contexts in the order 
of their similarities to the current context. Then, we give ranks to distinct sym- 
bols in accordance with the orders of their contexts. The next symbol is encoded 
as its actual rank. Figure^] shows an example of giving ranks to symbol candi- 
dates. In the figure, we assume that we have already encoded an initial segment 
ending with • to define’ and are going to encode its following symbol. In 
this example, if the next symbol is ‘m’ then it is encoded as rank 0. The space ‘J 
is encoded as rank 1, ‘d’ as rank 2, and so on. These ranks may be encoded by 
a fixed static code or an adaptive code. However, since the original implementa- 
tion took much time in the ranking phase, it was difficult to use adaptive codes, 
which are generally slower than static ones. In our new implementation, we can 
combine an adaptive arithmetic code 0 with the prefix list. We no longer need 



re-lexically sorted contexts 

■ ■ ■ new me 1 1 | hod 

■ ■ ■ is ine I d | ible 

■ • - comb ine | d | 

to define f?~| Candidates for the current symbol “?’ with their ranks: 

■■■ r efine|ir| ent 0 1 2 3 4 5 

■■■ affine □ ••• 

■ ■ -is l ine |~a~| r 

■ • -compo ne | n | t 

Fig. 10. Ranked candidates in the context-sorting compression algorithm. 
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to restrict the context length. The new implementation incorporating the prefix 
list runs more than ten times faster than the previous one with the bounded 
context length of 8 symbols. 

Obviously, the context-sorting compression method stimulates the develop- 
ment of prefix list. We may compare the prefix list and the context-sorting 
compression method to two sides of the same coin. The essential component of 
the latter method is the calculation of symbol’s rank. Therefore, when we design 
an improvement of the prefix list, we should consider not only its construction 
speed but also the possibility of efficient ranking of symbol candidates. 

Although the context-sorting compression method is basically a symbolwise 
algorithm, it can be extended to Ziv-Lempel-type phrase-based compression 
methods m The HYZ compression method mentioned in the introductory 
section includes such an extension. Another example is the ACB algorithm of 
Buyanovsky , whose primal version is essentially the same as LZ77 . What 
most distinguishes ACB from other LZ77 variants is its method of specifying the 
position of the longest match. If the longest match in the previous text begins 
with Xk in Fig. El where the current phrase begins with Si+i, then the value of k 
is encoded as the match position. Therefore, we can apply the prefix list to the 
ACB algorithm to calculate the value of k. 

5 Other Applications 

String matching (Finding the longest common suffix): The (exact) string 
matching problem is to find a string called the pattern in a longer string called the 
text. We interpret it as a problem of finding a position to insert the pattern into 
the re-lexically sorted list of prefixes of the text. If we wish to find an occurrence 
of ‘cabr’ in iS'[1..9] given in (j2|), it is enough to find the reverse lexicographic 
relationship: 



where A’ denotes reverse lexicographic order. In this case, we can immediately 
know that the pattern in question appears as 5 [6.. 9]. If we have another pattern 
‘rabr’, then a similar relation 



holds. This time, there is no exact matching; instead the longest common suffix 
‘abr’ can be found. 

Thus, the problem of finding an occurrence of pat of length m in text of 
length n is conceptually the same as building a prefix list for 



S'[1..5] = ‘yabre’ ^ ‘cabr’ ^ ‘yabrecabr’ = iS'[1..9], 



(9) 



iS'[1..9] = ‘yabrecabr’ ^ ‘rabr’ ^ ‘yabr’ = 5'[1..4] 



( 10 ) 



S'[l..n -|- m -|- 1] = text&pat, 



( 11 ) 



where the symbol & ^ Y is a special delimiter that alphabetically precedes any 
symbol in E. Of course, there is no need for the actual insertion of prefixes 
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ending in ^kpaV . Actually, we first build a prefix list for the text. In order to 
expedite the matching process, we use an auxiliary array of pointers, which map 
symbols to nodes in the prefix list. The array element corresponding to a symbol 
s includes a pointer Q that points to the node such that 

S[Q].pred"\.indx\ -< s = S[Q'\.indx\. (12) 

Namely, the pointer Q points to the leftmost node among nodes representing 
prefixes with the same last symbol s. This array of pointers can also be used in 
the course of constructing a prefix list in order to check whether the next symbol 
has appeared in the initial segment seen so far. If the array element corresponding 
to the first symbol of the pattern is a nil pointer, then we know that the pattern 
is not contained in the text. Otherwise, we proceed to a procedure similar to the 
insertion of the rest of the pattern. If the pattern is generated from the same 
Markov source as for the text, this procedure runs linearly in the pattern length. 
SufRx array construction: The suffix array for a string S')!..!;.] is an array 
of the indexes from 1 to n, specifying the lexicographic orders of the suffixes of 
0 , 0 - 

A prefix list maintains all prefixes in reverse lexicographic order. Thus, the 
construction of a suffix array is straightforward if we apply our prefix list con- 
struction procedure to a reversed string. It is sufficient to sequentially copy the 
indexes of a prefix list into a suffix array by traversing the list via succ links. 
Obviously, the conversion of a prefix list to a suffix array can be done in linear 
time. 

In some applications of suffix arrays, information about the longest common 
prefixes (Icps) plays an important role jSj. The quantity U of the ith node can 
be used as the Icp of consecutive elements of a resulting suffix array. 
Block-sorting data compression: The block-sorting data compression algo- 
rithm of Burrows and Wheeler has received considerable attention in an- 
ticipation that it may outperform the Lempel-Ziv codes. Its operation begins 
with a special sort procedure, called the BW transform, which is followed by a 
sequential application of move-to-front heuristics and statistical encoding. 

We can apply the prefix list to performing the BW transform. In our terms, 
the BW transform can be described in the following. Here, H and T denote the 
pointers to the list head and the list tail, respectively. 

P^H- 

while {P ^T) { 

if {Pf.next = nil) output{^t') else output{S[Pf.nextf.ind3\)-, 

P ^ Pf.SUCC', 

} 

This is not identical with the original definition of the transform but is essentially 
the same. The prefix list in Fig. EJconverts our sample string into 

iS'^[1..10] = ybbrrac&ea, (13) 

which in turn is encoded by a move-to-front coder. We are omitting the second 
half of the algorithm; see |2| for more details. 
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The BW transform is reversible; we can reconstruct S'[l..n] from its trans- 
formed string 1]. In order to explain the reverse transformation, we add 

subscripts to indicate the position of each symbol in In the above ex- 

ample, the input into the reverse transformation is S''[1..10] = ‘yj^b 2 b 3 r 4 r 5 a 6 C 7 &s 
egaio’. First, we sort alphabetically the symbols in this string in a stable manner. 
Then, we have 

S'"[1..10] = &8a6aiob2b3C7e9r4r5y^. (14) 

Second, we write 

7r(t) = j, I <i, 3 <n + l (15) 

for the symbol S"[i] if it appears as the jth symbol in S"[l..n -1-1]. In 

the present example, we have 7 t(8 ) = 1, 7 t(6 ) = 2, 7 t(10 ) = 3, and so on. Then, 
beginning with S *— e and t <— I, we repeat 



S ^ S'-S''[f]; 

i <— 7r(t) 



^or equivalently. 



i ^ 7r(i); 



n times to recover the original string in S. 

The (forward) BW transform is more demanding than the corresponding 
inverse transform. Our prefix list performs the forward transform in linear time 
at least on a string that is drawn from a finite-order Markov source. 



6 Conclusion 

We have presented a conceptually simple data structure, called the prefix list. It 
is a linked-list representation of prefixes of a string sorted in reverse lexicographic 
order. It is quite similar to the suffix array in that lexicographic linear order is 
incorporated. While the suffix array has an off-line nature, a prefix list can be 
built in an on-line manner. This yields its characteristic applications. The prefix 
list provides a powerful tool to a class of context-based symbol-ranking data 
compression algorithms. We have also shown that the prefix list is applicable to 
other interesting problems. 
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Abstract. We present a new indexing method for the approximate 
string matching problem. The method is based on a sufRx tree combined 
with a partitioning of the pattern. We analyze the resulting algorithm 
and show that the retrieval time is O(n^), for 0 < A < 1, whenever 
a < 1 — ej y/5, where a is the error level tolerated and a is the alphabet 
size. We experimentally show that this index outperforms by far all other 
algorithms for indexed approximate searching, also being the first exper- 
iments that compare the different existing schemes. We finally show how 
this index can be implemented using much less space. 



1 Introduction 

Approximate string matching is a recurrent problem in many branches of com- 
puter science, with applications to text searching, computational biology, pattern 
recognition, signal processing, etc. 

The problem is: given a long text of length n, and a (comparatively short) 
pattern of length to, retrieve all the text segments (or “occurrences”) whose edit 
distance to the pattern is at most k. The edit distance between two strings is de- 
fined as the minimum number of character insertions, deletions and replacements 
needed to make them equal. We define the “error level” as a = k/m. 

In the on-line version of the problem, the pattern can be preprocessed but 
the text cannot. The classical solution uses dynamic programming and is 0(mn) 
time EZIEBI A number of algorithms improved later this result pii Eoi rrei im 
r?Hl IdA n~21 rmi nni im . The lower bound of the on-line problem (proved and 
reached in ^21) is 0{n{k + log^ m)/m), which is of course f7(n) for constant to. 

If the text is large even the fastest on-line algorithms are not practical, and 
preprocessing the text becomes necessary. However, just a few years ago, index- 
ing text for approximate string matching was considered one of the main open 
problems in this area Despite some progress in the last years, the indexing 

schemes for this problem are still rather immature. 

There are two types of indexing mechanisms for approximate string match- 
ing, which we call “word-retrieving” and “sequence-retrieving” . Word retrieving 

* This work has been supported in part by Fondecyt grant 1-990627 and Fondef grant 
96-1064. 
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indices \'Z'A 0 are more oriented to natural language text and information 

retrieval. They can retrieve every word whose edit distance to the pattern is at 
most k. Hence, they are not able to recover from an error involving a separator, 
such as recovering the word "flowers" from the misspelled text "flo wers" 
or from "manyf lowers", if we allow one erroiQ. These indices are more mature, 
but their restriction can be unacceptable in some applications, especially where 
there are no words (as in DNA) or in agglutinating languages such as Finnish 
or German. 

Our focus in this paper is sequence retrieving indices. Among these, we find 
two types of approaches. 

A first type is based on simulating a sequential algorithm, but running it on 
the suffix tree m d or DAWG mi cni of the text instead of the text itself. 
Since every different substring in the text is represented by a single node in the 
tree or the DAWG, it is possible to avoid redoing the same work when the text 
has repetitions. Those indices take 0(n) space and construction time, but their 
construction is not optimized for secondary memory and is very inefficient in 
this case (see, however, 1151 b Moreover, the structure is very inefficient in space 
requirements, since it takes 12 to 70 times the text size. 

In 0133 El, different algorithms that traverse the least possible nodes in 
the suffix tree (or in the DAWG) are presented. The idea is to traverse all the 
different tree nodes that represent “viable prefixes” , which are text substrings 
that can be prefixes of an approximate occurrence of the pattern. 

In 1171 . a simplified version of the above technique was independently pro- 
posed, consisting of a limited depth-first search (DFS) on the suffix tree. Since 
every substring of the text (i.e. every potential occurrence) can be found from 
the root of the suffix tree, it is sufficient to explore every path starting at the 
root, descending by every branch up to where it can be seen that that branch 
does not represent the beginning of an occurrence of the pattern. This algorithm 
inspects more nodes than the previous ones, but it is simpler. For instance, with 
an additional O(logn) time factor, the algorithm runs on suffix arrays, which 
take 4 times the text size instead of 12. This algorithm was analyzed in 

The second type of sequence-retrieving indices is based on adapting an on-line 
filtering algorithm. The filters are based in matching substrings of the patterns 
without errors, and checking for potential occurrences around those matches. 
The index is used to quickly find those substrings, and is based on storing some 
text q-grams (substrings of length q) and their positions in the text. 

Different filtration indices (231 1231 EZ] differ mostly in how the text is 

sampled (distance between consecutive text samples, whether they overlap or 
not, etc.), in how the pattern is sampled, in how many matching samples are 
needed to verify their neighborhood in the text, etc. Depending on this and on 
q they achieve different space-time tradeoffs. In general, filtration indices are 
much smaller than suffix trees (1 to 10 times the text size), although they are 
less tolerant to the error level a. They can also be built in linear time. 



^ Although some, like Glimpse m, can match the pattern inside a text word. 
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Somewhat special is because it does not reduce the search to exact but to 
approximate search of pattern pieces. To search for a pattern of length m < q—k, 
all the maximal strings with edit distance < k to the pattern are generated and 
searched in the set of g-grams. Later, all the occurrences are merged. Longer 
patterns are split in as many pieces as necessary to make them short enough. 

In this paper we present a hybrid indexing scheme for this problem. It uses 
a suffix tree, where the pattern is partitioned in subpatterns which are searched 
with less errors in suffix tree. All the occurrences of the subpatterns are later ver- 
ified for a complete match. The goal is to balance between the cost to search in 
the suffix tree (which grows with the size of the subpatterns) and the cost to ver- 
ify the potential occurrences (which grows when shorter patterns are searched) . 
This method shows experimentally to be by far superior to all other implemented 
proposals, and we show analytically that the average retrieval time can be made 
Q(j.j 2 (a+ff,x(a))/(i-i-o!)), Hcr{a) is the base-cr entropy function. This is sub- 

linear for a < t — This limit on a cannot probably be improved jElES). We 

finally propose an alternative data structure to reduce the space requirements 
of the suffix tree, with little time penalty. 

2 Combining SufRx Trees and Pattern Partitioning 

We present now our alternative proposal. The general idea is to partition the 
pattern in pieces, search each piece in the suffix tree in the classical way, and 
check all the positions found for a complete match. We first consider how to 
search a piece in the suffix tree and later address the pattern partitioning issue. 



2.1 DFS Using a Bit-Parallel Automaton 

Let us consider the existing algorithms to traverse the suffix tree. While 
eg minimize the number of nodes traversed, El is simpler but inspects more 
nodes. We show that m, thanks to its simplicity, can be adapted to use a node 
processing algorithm which is faster than dynamic programming, namely our 
on-line algorithm of IE0. The tradeoff is: we can explore less nodes at higher 
cost per node or more nodes at less cost per node. We show later experimentally 
that this last alternative is much faster when 0 is used to process the nodes. 

We recall that the idea of El is a limited depth-first search on the suffix 
tree, starting at the root and stopping when it can be seen that the current text 
substring cannot start an approximate pattern occurrence. No text occurrence 
can be missed because every text substring can be found starting from the root. 

More specifically, we compute the edit distance between the tree path and the 
pattern, and if at some node the distance is < k we know that the text substring 
represented by the node matches the pattern. We report all the leaves of the 
suffix tree which descend from those nodes, since their text positions start with 
the matching substring. On the other hand, when we can determine that the 



^ Probably |24| would also fit well. 
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edit distance cannot be as low as k, we abandon the path. This surely happens 
at depth m + fc + 1 but normally happens before. 

We implement this traversal using our algorithm of |H| instead of dynamic 
programming. This algorithm uses bit parallelism to simulate a non-deterministic 
finite automaton (NFA) that recognizes the approximate pattern. We modify this 
automaton to compute edit distance (removing the initial self- loop it has in |B|). 

Figured shows the automaton to recognize "patt" with k = 2 errors. Every 
row denotes the number of errors seen. Every column represents matching a pat- 
tern prefix. Horizontal arrows represent matching a character (i.e. if the pattern 
and text characters match, we advance in the pattern and in the text). All the 
others increment the number of errors (move to the next row): vertical arrows 
insert a character in the pattern (we advance in the text but not in the pattern), 
solid diagonal arrows replace a character (we advance in the text and pattern), 
and dashed diagonal arrows delete a character of the pattern (they are empty 
transitions, since we advance in the pattern without advancing in the text). The 
automaton signals (the end of) a match whenever a rightmost state is active. If 
we do not care about the number of errors of the occurrences, we can consider 
final states those of the last full diagonal. 




no errors 



1 error 



2 errors 



Fig. 1. An NFA for approximate string matching. Unlabeled transitions match 
any character. Dotted lines enclose the states actually represented in our algo- 
rithm. 



Initially, the active states at row i are at the columns from 0 to i, to represent 
the deletion of the first i characters of the pattern. We do not need in fact to 
represent the initial lower-left triangle, since if a substring matches with initial 
insertions we will find (in other branch of the suffix tree) a suffix of it which does 
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not need the insertion^. On the other hand, unlike 0, we need to represent the 
first full diagonal, since now it will not be always active. We start the automaton 
with only this first full diagonal active, and traverse the suffix tree path until 
the automaton runs out of active states or the lower right state is activated. 

The simulation of this automaton needs (m — A: + l)(fc + 2) bits. If we call 
w the number of bits in the computer word, then when the previous number 
is < w we can put all the states in a single computer word and work 0(1) per 
traversed node of the suffix tree. For longer patterns, the automaton is split in 
many computer words, at a cost of 0{k{m — k)/w). For moderate-size patterns 
this improves over dynamic programming, which costs 0(m) per suffix tree node. 

This bit-parallel variation is only possible because of the simplicity of the 
traversal. For instance, the idea does not work on the more complex setup of 
BSEI, since these need some adaptations of the dynamic programming algo- 
rithm that are not easy to parallelize. Note that this algorithm can be seen as a 
particular case of automaton searching over a trie |^. 



2.2 Partitioning the Pattern 

It is well known PBUi that the search cost using the suffix tree grows exponen- 
tially with m and k, no matter which of the two techniques we use (optimum 
traversal or DFS). Hence, we prefer that m and k are small numbers. We present 
in this section a new technique based in partitioning the pattern, so that the 
pattern is split in many sub-patterns which are searched in the suffix tree, and 
their occurrences are directly verified in the text for a complete match. We show 
in the experiments that this technique outperforms all the others. 

This method is based on the pattern partitioning technique of E3 0. The 
core of the idea is that, if a pattern of length m occurs with k errors and we 
split the pattern in j parts, then at least one part will appear with [k/ j\ errors 
inside the occurrence. In fact, the case j = k + 1 is the basis for the algorithm 
0 and the g-gram index 

The new algorithm follows. We evenly divide the pattern in j pieces (j is 
unspecified by now). Then we search in the suffix tree the j pieces with [k/j\ 
errors using the algorithm of Section IZ. 11 For each match found ending at text 
position i we check the text area [i — m — k..i + m + k]. 

The reason why this idea works better than a simple suffix tree traversal 
with the complete pattern is that, since the search cost on the suffix tree is 
exponential in m and k, it may be better to perform j searches of patterns of 
length m/ j and k/ j errors. However, the larger j, the more text positions have 
to be verified, and therefore the optimum is in between. In the next section we 
find analytically the optimum j and the complexity of the search 

One of the closest approaches to this idea is Myers’ index which collects 
all the text g-grams (i.e. prunes the suffix tree at depth g), and given the pattern 
it generates all the strings at distance at most k from it, searches them in the 

® If, after traversing a text substring s, a 1 finally exits from the lower-left triangle, 
then a suffix of s will do the same without entering into the triangle. 
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index and merges the results. This is the same work of a suffix tree provided 
that we do not enter too deep (i.e. m + k < q). li m + k > q, Myers’ approach 
splits the pattern and searches the subpatterns in the index, checking all the 
potential occurrences. The main difference with our proposed approach is that 
Myers’ index generates all the strings at a given distance and searches them, 
instead of traversing the structure to see which of them exist. This makes that 
approach degrade on biased texts, where most of the generated g-grams do not 
exist (in the experimental section we show that it works well on DNA but quite 
bad on English). Moreover, we split the pattern to optimize the search cost, 
while the splitting in Myers’ index is forced by indexing constraints (i.e. q). 



3 Analysis 

3.1 Searching One Piece 

An asymptotic analysis on the performance of a depth-first search over suffix 
trees is immediate if we consider that we cannot go deeper than level m + k 
since past that point the edit distance between the path and our pattern is 
larger than k and we abandon the search. Therefore, we can spend at most 
(^(^m-i-fc) which is independent on n and hence 0(1). Another way to see 
this is to use the analysis of [^, where the problem of searching an arbitrary 
automaton over a suffix trie is considered. Their result for this case indicates 
constant time (i.e. depending on the size of the automaton only) because the 
automaton has no cycles. 

However, we are interested in a more detailed average analysis, especially the 
case where n is not so large in comparison to We start by analyzing which 

is the average number of nodes at level i in the suffix tree of the text, for small 
i. Since almost all suffixes of the text are longer than £ (i.e. all except the last 
£), we have nearly n suffixes that reach that level. The total number of nodes at 
level £ is the number of different suffixes once they are pruned at £ characters. 
This is the same as the number of different Agrams in the text. If the text is 
random, then we can use a model where n balls are thrown into urns, to find 
out that the average number of filled urns (i.e. suffix tree nodes at level £) is 

(l- ( 1 - 1 /tT^)”) = = 0 (min(n,cr^)) 

which shows that the average case is close to the worst case: up to level log^n 
all the possible nodes exist, while for deeper levels all the n nodes exist. 

We also need the probability of processing a given node at depth £ in the 
suffix tree. In the Appendix we prove that the probability is very high for (3 = 
k/£ > 1 — cj y/a (Eq. (|3D), and otherwise it is 0{"f{f3Y), where 7(/3) < 1. The 
constant c can be proven to be smaller than e = 2.718..., and is empirically 
known to be close to 1. The 7 ( 0 ;) function (Eq. (P)) is 1 / {1 — ^ 

which goes from l/cr to 1 as a; goes from 0 to 1 — cl^/a. 
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Therefore, we pessimistically consider that in levels 

I < L(fc) = ^ = 0(fc) 

1 - c/i/cr 

all the nodes in the suffix tree are visited, while nodes at level i > T(fc) are 
visited with probability 0{'j{k/iY), where 'y{k/i) < 1. Finally, we never work 
past level m + k. We are left with three disjoint cases to analyze, illustrated in 
Figure 121 




Fig. 2. The upper left figure shows the visited parts of the tree. The rest shows 
the three disjoint cases in which the analysis is split. 



(o) L(fc) > log^ n, i.e. n < or “small n” 

In this case, since on average we work on all the nodes up to level log^. n, 
the total work is n, i.e. the amount of work is proportional to the text size. 
This shows that the index simply does not work for very small texts, being 
an on-line search preferable as expected. 

(6) m + k < log^ n, i.e. n > or “large n” 

In this case we traverse all the nodes up to level L(k), and from there on we 
work at level £ with probability "f{kjiY, until i = m + k. Under case (6), 
there are nodes at level 1. Hence the total number of nodes traversed is 



170 



Gonzalo Navarro and Ricardo Baeza-Yates 






^=0 



m+fc 

+ E 

t=L(k) + l 



where the first term is For the second term, we see that '){x) > 1 /cr, 

and hence ("f(k/ €}< tY > 1. More precisely, 



(7 (/c/£)(t)^ 






which grows as a function of £. Since {'j{k/£)aY > 1, we have that even if it 
were constant with i, the last term would dominate the summation. Hence, 
the total cost in case (6) is 



^2k 

which is independent of n. 

(c) L{k) < log^ n < m + k, i.e. “intermediate n” 

In this case, we work on all nodes up to L(k) and on some nodes up to m + k. 
The formula for the number of visited nodes is 



L(k) log,^(n)-l m+k 

E + E 7(^/^)^cr^ + E li^l^Yn 

£—0 £—L(k)-\-l £—log^n 

The first sum is For the second sum, we know already that the last 

term dominates the complexity (see case (6)). Finally, for the third sum we 
have that j{k/£) decreases as £ grows, and therefore the first term dominates 
the rest (which would happen even for a constant 7). 

Hence, the case £ = log^ n dominates the last two sums. This term is 



nj{k/ log^ " 



cr'=(log^ 

fc2fe(log^(n) — fc)2(log,,(n)-fc) 



cr'‘(log,^ nY'^ 

k2k 



(1+0(1)) 



(this can be bounded by (cr(l + 1/a)^)^ by noticing that we are inside case 
(c), but we are interested in how n affects the growth of the cost). 



The search time is then sublinear for log^, n > max(L(fc), m + k), or which is 
the same, a < max(log^(n)/m (1 — c/v^), log^, (n)/m — 1). Figure 0 illustrates. 



3.2 Pattern Partitioning 

When pattern partitioning is applied, we perform j searches of the same kind of 
Section \l. II this time with patterns of length m/ j and k/ j errors. We also need 
to verify all the possible matches. 
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Fig. 3. Area of sublinearity for suffix tree traversal. 



As shown in |^, the matching probability for a text position is 0(7(0)™), 
where 7(0) is that of Eq. ( 0 . From now on we use 7 = 7(0). Using dynamic 
programming, a verification costs O(m^) 0 . Hence, our total search cost is 

jxsuffixJreeJraversal{m/j,k/j) + j x 7™/'^m^n 

and we want the optimum j. First, notice that if 7 = 1 (that is, a > 1 — c/-\/ct), the 
verification cost is as high as an on-line search and therefore pattern partitioning 
is useless. In this case it may be better to use plain DFS. In the analysis that 
follows, we assume that 7 < 1 and hence a < 1 — c/ \f 5 . 

According to Section rTTl we divide the analysis in three cases. Notice that 
now we can adjust j to select the best case for us. 

(а) > n, or j log„ n < k/ {I — c/y/a) 

In this case the search cost is I 7 (n) and the index is of no use. 

(б) < n, or j log^ n > to -I- fc 
In this case the total search cost is 

■' y 

where the first two terms decrease and the last one increases with j. Since 
a + b = 0(max(a, 6)), the minimum order is achieved when increasing and 
decreasing terms meet. When equating the first and third terms we obtain 
that the optimum j is 






Ji = 



■log,, (1/7) 



log„ {rinP-n) \ 1 — c/ ^/a 
and the complexity (only considering n) is O *°s,(i/7))y 



It can be done in 0 {{m/j)'^) time |2:-lll2li| . but this does not affect the result here. 
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On the other hand, if we equate the second and third term, the best j is 

7TL 

h = log (l + 2((l + a)log^(l + a) + (l-a)log^(l-a))) 

and the complexity is O 

In any case, we are able to achieve a sublinear complexity of O(n^), where 

\ v>^o■w-/' ^ 1 log^(l/7) 

' Q+(l-c/v^) log^(l/ 7 ) ’ 1+2((1+q;) log^(l+Q;) + (l-Q) log^(l-a)) ) 

Which of the two complexities dominates yields a rather complex condition 
that depends on the error level a, but in both cases A< lifa< 1 — c/ ^/a. 
If cr is large enough (cr > 24 for c = e), the complexity corresponding to 
always dominates. However, it is possible that ji or j '2 are outside the 
bounds of case (b) (i.e. they are too small). In this case we would use the 
minimum possible j = (m + k)/ log^ n, and the third term would dominate 
the cost, for an overall complexity of This complexity 

is also sublinear if a < 1 — c/ 

(c) <n < , or A:/(l — c/-ya-) < j log^. n < to + /c 

The search cost in this intermediate case is 

where the first two terms decrease with j and the last one increases. Repeat- 
ing the same process as before, we find that the first and third term meet 
again at j = ji with the same complexity. We could not solve exactly where 
the second and third term meet. We found 

m{a + 2a log^ log^ n + log^ ^ - 2a log^ ^ TO(a -flog^ ;^) 
log^(m'^n) log^(TO^n) 

and since the solution is approximate, the terms are not exactly equal at jg. 
The second term is O slightly higher than 

the third. Again, it is possible that jg is out of the bounds of case (c) and 
we have to use the same limiting value as before. 

The conclusion is that, despite that the exact formulation is complex, we have 
sublinear complexity for a < 1 — cj ^/a, as well as formulas for the optimum j 
to use, which is 0(TO/logg. n) with a complicated constant. 

For larger a values the pattern partitioning method gives linear complexity 
and we need to resort to the traditional suffix tree traversal (j = 1). As shown 
in jSl ES], it is very unlikely that this limit of 1 — cj^fu can be improved, since 
there are too many real approximate occurrences in the text. 

An interesting fact that is shown in the experiments is that in many cases 
the optima are out of bounds and hence the best is to put j in the limit of cases 
(b) and (c), just where the search of the subpieces become full searches. This 
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shows that a technique that is simple and the best choice in most cases is to 
select j = (to + fc) / logo- n, for a complexity of 

O in i+“ 1 = O in i+“ 1 

where Ha{a) = — alog^ a — {1 — a) log£,(l — a) is the base-cr entropy function. 



3.3 The Limits of the Method 

Let us pay some attention to the limits of our hybrid method (Figure 0). 

Since j = 0{m/ log^n), the best j becomes 1 (i.e. no pattern partitioning) 
when n > (this is because the cost of verifications dominates over suffix 

tree traversal). The best j is > k+1 for n < Since in this case we search 

the pieces with zero errors (i.e. [k/{k + 1)J = 0, recall Section E2), the search 
in the suffix tree costs 0{m), and later we have to verify all their occurrences. 
This is basically what the g-gram index of 0 does, except it prunes the suffix 
tree at depth q. 

Finally, the only case where the index is not useful is when n is very small. 
We can increase j to be more resistant to small texts, but the limit is j = A: + 1, 
and using that j the index ceases to be useful for n < < cr^/“. We have 

also to keep sublinear the cost of verifications, i.e. = o(l), which happens 

for a < 1/ logi/^n. This requires, in particular, that to = l7(logn). 



nothing i 


maximal i 


intermediate j 


1 i = 1 




useful ' 

1 


j = k+1 


(hybrid index) 


1 no partit. 










^e(m) 


n 



Fig. 4. The j values to be used according to n. 



This last consideration helps also to understand how is it possible to have a 
sublinear-time index based on filtering when there is a fixed matching probability 
per text position ( 7 "*), and therefore the verification cost must be Q{n). The 
trick is that in fact we assume to = l7(logn), that is, we have to search longer 
patterns as the text grows. As we can tune j, we softly move to j = 1 (then 
eliminating verification costs) when n becomes large with respect to to. This 
“trick” is also present in the sublinearity result of Myers’ index and implicit 
in similar results on natural language texts 13,^. 

4 Experimental Results 

We first validate some of the analytical results of the paper and later compare 
our indices against the other existing proposals. We used two different texts: 
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— DNA text (“h. influenzae”), a 1.34 Mb file. This file is called DNA in our tests, 
and H-DNA is the first half megabyte of it. In this case cr = 4. 

— English literary text (from B. Franklin), Altered to lower-case and the sep- 
arators converted into a single space. This text has 1.26 Mb, and is called 
FRA in the experiments, h-fra is the first half megabyte of fra. Given how 
the analysis uses the a value, it is unrealistic to set it to the alphabet size, 
because the text is biased. It is much better to consider that 1/a must be 
the probability that two random letters are equal. This sets a = 12.85. 

The texts are rather small, in some cases too small to appreciate the speedup 
obtained with some indices. This is because we had RAM problems to build 
suffix trees for larger texts. However, the experiments still serve to obtain basic 
performance numbers on the different indices. 

We have tested short and medium-size patterns, searching with 1, 2 and 3 
errors the short ones and with 2, 4 and 6 the medium ones. The short patterns 
were of length 10 for DNA and 8 for English, and the medium ones were of length 
20 and 16, respectively^. We selected 1000 random patterns from each file and 
use the same set for all the k values of that length, and for all the indices. 



4.1 Validating the Analysis 

We first show that the suffix tree traversal has sublinear complexity. We built the 
suffix tree of incremental prefixes of fra and DNA, from 100 Kb to 800 Kb (larger 
texts start to give i/o problems that disturb the CPU measures). According to 
our analysis, the m, k and a values used correspond to intermediate text sizes 
(case (b) of Section f3.1ll for n = 4Kb. .4Mb on DNA and for n = 40Kb. .8Gb 
on FRA. Hence, we are clearly in case (6) in all our experiments. The analysis 
predicts a complexity of 0((logn)^^). 

Figure 0shows the user time as n grows, from where the sublinearity is clear. 
We have used least squares with the model t = aln(n)^ to And out the empirical 
complexity and present it compared to the analytical complexity. The error of 
the approximation is always below 5%. We see that the analysis approximates 
reasonably the empirical results, despite the many simplifications done. 

We consider now the optimal j value for pattern partitioning. Table^p resents 
the query time using different j values in our index, for the fra, h-fra, dna, 
and h-dna texts. As it can be seen, there are big differences in time depending on 
j, and the optimum is a rather small j value (always 1 on short patterns). This 
matches reasonably our formulas. In fact, once properly rounded, our analysis 
recommends the correct j values. As mentioned before, the relevant value is 
always in the limit between cases (6) and (c). 

Figure |H| shows the user time for long patterns, as n grows, using pattern 
partitioning with j = 2. This time we have used least squares with the model 
t = an/. The error of the approximation is always below 2%. It can be seen 

® This is because of the restrictions of Myers’ index intersected with our interest in 
moderate-length patterns. 
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Text (m, k) 


experiment 


analysis 


DNA (10,1) 


2.83 


2 


DNA (10,2) 


3.78 


4 


DNA (10,3) 


3.79 


6 


FRA (8,1) 


2.57 


2 


FRA (8,2) 


3.53 


4 


FRA (8,3) 


4.17 


6 





Fig. 5. User query time (in milliseconds) for short patterns as n grows for k = 1 
to 3, using j = 1. On the top right, the empirical and analytical exponent of 
log n. 



that also in this case the analysis approximates reasonably the empirical results, 
slightly overestimating in most cases. The combination DNA (20,6) is not included 
because it takes too long and already the case (20,4) was clearly linear. 



4.2 Comparison Against Others 

We compare our index with the other existing proposals. However, as the task 
to program an index is rather heavy, we have only considered the other indices 
that are already implemented. The indices included in this comparison are 

Myers’: The index proposed by Myers m We use the implementation of the 
author, which works for some m values only (that depend on cr and n). 
Cobbs’: The index proposed by Cobbs ng. We use the implementation of the 
author, not optimized for space. The code is restricted to work on an alphabet 
of size 4 or less, so it is only built on DNA. 
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Text 


(10,1) 


(10,2) 


(10,3) 


(20,2) 


(20,4) 


(20,6) 


DNA 


1: 6.81 


1: 134.4 


1: 1044 


1: 56.81 


1: 1989 


1: 10075 




2: 2391 


2: 2585 


2: 2756 


2: 15.80 


2: 1033 


2: 9525 










3: 802.8 


3: 1010 


3: 8841 












4: 5862 


4: 39077 


H-DNA 


1: 2.71 


1: 44.29 


1: 394.6 


1: 23.72 


1: 499.9 


1: 2308 




2: 645.0 


2: 715.3 


2: 860.4 


2: 6.01 


2: 305.2 


2: 2482 










3: 232.7 


3: 305.9 


3: 2464 












4: 1520 


4: 10339 


Text 


(8,1) 


(8,2) 


(8,3) 


(16,2) 


(16,4) 


(16,6) 


FRA 


1: 6.11 


1: 42.82 


1: 215.2 


1: 35.98 


1: 482.9 


1: 2204 




2: 180.6 


2: 1754 


2: 19600 


2: 13.30 


2: 88.22 


2: 464.0 










3: 90.71 


3: 736.6 


3: 4718 


H-FRA 


1: 2.68 


1: 14.28 


1: 60.91 


1: 13.39 


1: 126.4 


1: 542.4 




2: 61.43 


2: 601.1 


2: 4920 


2: 5.30 


2: 30.70 


2: 146.4 










3: 32.72 


3: 255.5 


3: 1538 



Table 1. User query time (in milliseconds) for different (m, k) values (heading 
rows). Inside each cell we show the cost for different j values. The optimum is 
in boldface. 



Samples(g): Our index based on g-grams presented in jZj. We show the results 
for g = 4 to 6. 

Dfs(a/p): Our new index based on suffix trees. We show the results for the base 
technique (a) and pattern partitioning (p) with optimal j. 

In particular, approximate searching on other g-gram indices m is not yet 
implemented and therefore is excluded from our tests. We know, however, that 
their space requirements are low (close to a word-retrieving index), but also that 
since the index simulates the on-line algorithm m, its tolerance to errors is 
quite low (see iHiiini, for example). 

All the indices were set to show the matches they found, in order to put them 
in a reasonably real scenario. We present the time to build the indices and the 
space they take in Table 0 

The first clear result of the experiment is that the space usage of the indices 
is very high. In particular, the indices based on suffix trees (Dfs and Cobbs’) 
take 35 to 65 times the text size. This outrules them except for very small texts 
(for instance, building Cobbs’ index on 1.34 Mb took 12 hours of real time in our 
machine of 64 Mb of RAM) . From the other indices, Myers’ took 7-9 times the 
text size, which is much better but still too much in practice. The best option 
in terms of space is the Samples index, which takes from 1.5 to 7 times the text 
size, depending on g and a. The larger g or cr, the larger the index. Samples(5), 
which takes 2-5 times the text size, performs well at query time. 

Compared to its size, Myers’ index was built very quickly. The Dfs index, on 
the other hand, was built faster than Cobbs’. Notice that suffix trees are built 
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Text (m, k) 


experiment 


analysis 


DNA (20,2) 


0.547 


0.608 


DNA (20,4) 


1.009 


0.935 


FRA (16,2) 


0.470 


0.485 


FRA (16,4) 


0.624 


0.752 


FRA (16,6) 


0.753 


0.922 





Fig. 6. User query time (in milliseconds) for medium patterns as n grows for 
A: = 2 to 6, using j = 2. On the top right, the empirical and analytical exponent 
of n. 



quickly when they fit in RAM (as in the half-megabyte texts), but for larger 
texts the construction time is dominated by the i/o, and it increases sharply. 

We consider now query time. Tables 0 and 0 present a comparison between 
the different indices, using for Dfs(p) the optimum j value of Table Q] (only for 
medium patterns, since for short ones Dfs(a) is always better). The system time 
is included because it is dominant in many cases. We include also the time of 
on-line searching for comparison purposes (we use the fastest on-line algorithm 
for each case). The results clearly show a number of facts. 

— The indices work well only for moderate error levels. For larger texts the ratio 
indexed/on-line should improve. However, when i/o time is considered many 
indices seem useless, and it is not so clear that this improves for larger texts. 
This depends on the amount of main memory available, and is a consequence 
of most indices not being designed to work on secondary memory. This is a 
very important issue that has been rarely addressed. 
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Index 


DNA 


H-DNA 


FRA 


H-FRA 


Myers’ 


5.84u+0.35s 
10.68 Mb (7.97X) 


2.08u-|-0.12s 
4.50 Mb (9.00X) 


5.22u+0.34s 
9.39 Mb (7.46X) 


2.01u-|-0.12s 
4.18 Mb (8.35X) 


Samples 

(4) 


5.53u+0.19s 
2.04 Mb (1.52X) 


1.95u-b0.10s 
0.77 Mb (1.53X) 


15.05u+0.41s 
3.48 Mb (2.77X) 


5.90u-F0.24s 
1.48 Mb (2.98X) 


Samples 

(5) 


7.37u+0.24s 
2.48 Mb (1.85X) 


2.62u-|-0.08s 
0.94 Mb (1.87X) 


20.82u+0.70s 
5.18 Mb (4.11X) 


8.70s+0.35s 
2.32 Mb (4.65X) 


Samples 

(6) 


10.53u+0.32s 
2.90 Mb (2.16X) 


3.88u-|-0.13s 
1.11 Mb (2.23X) 


32. 86u-F 1.34s 
7.65 Mb (6.07X) 


13.19u+0.97s 
3.54 Mb (7.07X) 


Cobbs’ 


108.70u+532.81s 
87.99 Mb(65.67X) 


30.50u-F76.06s 
32.93 Mb (65.85X) 


n/a 


Dfs 


30.89u+104.17s 
52.25 Mb (38.99X) 


6.48u-|-0.42s 
19.55 Mb(39.10X) 


28.46u+76.86s 
44.66 Mb(35.45X) 


6.43u-F0.61s 
17.66 Mb (35.32X) 



Table 2. Times (in seconds) to build the indices and their space overhead. The 
time is separated in the CPU part (“u”) and the i/o part (“s”). The space is 
expressed in megabytes, and also the ratio index/text is shown in the format 
rX, meaning that the index takes r times the text size. 



— Our strategy Dfs(a) of using a simpler traversal algorithm on the suffix tree 
and in return using a faster search algorithm definitively pays off, since our 
implementation is 3 to 40 times faster than Cobbs’, and it is the fastest 
choice for small m and k values. Independently of this fact, the suffix tree 
indices improve on larger alphabets, but they are much more sensitive to the 
growth of m or k. In fact, the differences between fra and dna are due to 
the different values of m used. The big problem with this type of index is of 
course the huge space requirements it poses. 

— Myers’ index behaves better on short patterns, when less splitting is neces- 
sary. It works well for DNA but it worsens on English text. We conjecture 
that the non-randomness may play a role here: the index takes internally 
q = logj,. n to avoid searching a number of nonexistent samples that are at 
distance k or less from the pattern (in our case it took g = 10 for DNA and 
g = 4 for English). However, in biased texts like English, a lot of g-grams are 
not present anyway, and the index pays to search all of them. For DNA the 
index is a good alternative, since although it is up to 13 times slower than 
Dfs(o), it takes 4 times less space. It is also better than the Samples index 
when the pattern is short, but not when pattern partitioning is necessary. 

— The Samples index reaches its optimum performance for g between 5 and 
6, depending on the case. Unlike Myers’, this index works better on English 
text than on DNA. In DNA it produces a small index (4 times smaller 
than Myers’) but in general has worse search time. The index for g = 5 on 
English text is half the size of Myers’ index, and it also obtains good results 
for medium patterns and low error levels. 

— Dfs(p), which works on the same data structure of Dfs(a), improves over it 
when the patterns are not very short and the error level is not too high. 
When applicable, its query time is by far the lowest among all the indices. 
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Index 


k 


DNA 

(m = 10) 


H-DNA 
(m = 10) 


FRA 
(m = 8) 


H-FRA 
(m == 8) 




1 


131.0/21.35 


55.01/15.24 


59.74/17.31 


29.99/9.00 


On-line 


2 


152.6/20.56 


62.41/15.48 


114.8/20.86 


52.77/11.56 




3 


188.7/20.36 


84.20/15.33 


142.2/20.56 


60.30/13.76 




1 


0.29/1.74 


0.64/2.15 


7.04/8.04 


6.17/7.29 


Myers’ 


2 


0.97/2.18 


1.53/2.74 


23.5/21.4 


20.2/18.2 




3 


6.29/6.79 


8.17/8.10 


22.4/20.8 


20.9/18.5 




1 


1.80/6.66 


1.72/5.48 


0.75/2.01 


0.75/1.76 


Samples 


2 


9.33/26.7 


9.10/23.4 


3.30/9.54 


2.69/2.68 


(4) 


3 


30.7/93.4 


25.5/73.6 


13.8/30.3 


13.6/26.7 




1 


0.91/2.81 


0.93/2.38 


0.75/1.91 


0.77/1.74 


Samples 


2 


9.88/27.7 


9.35/23.5 


4.92/10.4 


3.47/7.07 


(5) 


3 


36.4/97.2 


30.9/77.3 


23.9/38.9 


21.5/33.5 




1 


0.90/2.71 


0.93/2.35 


0.89/2.06 


0.86/1.82 


Samples 


2 


11.3/29.4 


10.9/24.6 


6.81/12.5 


4.81/8.99 


(6) 


3 


57.3/119 


49.0/92.5 


39.3/52.8 


38.9/47.7 




1 


0.83/1.98 


1.85/3.67 






Cobbs’ 


2 


3.85/14.9 


6.04/19.1 


n/a 


n/a 




3 


17.9/84.5 


21.8/79.3 








1 


0 . 05 / 0.15 


0 . 05 / 0.04 


0 . 10 / 0.25 


0 . 09 / 0.07 


Dfs(o) 


2 


0 . 88 / 2.72 


0 . 71 / 0.57 


0 . 37 / 0.96 


0 . 27 / 0.22 




3 


5.53/16.9 


4.69/3.96 


1.51/4.39 


1 . 01 / 0.82 



Table 3 . Query time for short patterns and for 1, 2 and 3 errors. The on-line 
algorithm shows time in milliseconds in the format “user/system”, in italics. 
The indexed algorithms show the fraction they take of the time of the on-line 
algorithm. The format is “a/6”, where a considers only user time and 6 considers 
both. The fastest indexed times are in boldface. 



5 Conclusions and Future Work 

We have presented a new indexing scheme for approximate string matching. The 
main idea is to split the pattern in pieces to be searched with less errors, and use 
a suffix tree to find their approximate matches in the text. Later, we verify all 
their matches for an occurrence of the complete pattern. The splitting technique 
balances between traversing too many nodes of the suffix tree and verifying 
too many text positions. We have also shown how to traverse the suffix tree 
efficiently in practice. We have proved analytically that the resulting index has 
sublinear retrieval time (of the form 0(n''‘), where 0 < A < 1 if the error level is 
moderate). Finally, we have presented the first (as far as we know) experimental 
results that compare the different implemented indexing schemes, which show 
that the proposed idea improves over all the previously implemented approaches. 

A remaining problem is that the suffix tree data structure needs too much 
space. We plan to replace it by a suffix array (zq. The suffix array contains the 
leaves of the suffix tree in left-to-right order, or equivalently the pointers to all the 
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text suffixes in lexicographical order. The space requirement is in practice 4 times 
the text size, which is reasonable. Suffix tree nodes (i.e. subtrees) correspond 
to suffix array intervals. Any movement in the suffix tree can be simulated in 
0(log n) time in the suffix array, and therefore the final complexity is multiplied 
by O(logn) and the condition for time sublinearity is not affected. Finally, we 
are still free to use the j value we like (unlike g-gram indices, which are limited 
by q). In particular, we can easily implement specialized pattern partitioning 
approaches for biased texts as in 0, where the partitioning minimizes the total 
number of text positions to verify. 



Index 


k 


DNA 

(m = 20) 


H-DNA 
(m = 20) 


FRA 

(m = 16) 


H-FRA 
(m = 16) 




2 


184.6/22.18 


75.16/16.61 


60.59/17.56 


29.91/9.48 


On-line 


4 


311.4/21.70 


116.0/15.79 


116.3/20.83 


50.71/14.98 




6 


779.2/21.42 


297.4/15.77 


205.6/20.58 


92.36/13.37 




2 


0.67/1.69 


0.91/1.97 


7.03/8.06 


10.9/10.9 


Myers 


4 


5.13/5.50 


5.61/5.74 


32.7/29.2 


31.9/26.3 




6 


16.9/16.8 


17.7/17.3 


26.5/25.0 


25.2/23.1 




2 


1.55/5.10 


1.60/4.55 


0.44/0.95 


0.63/1.03 


Samples 


4 


6.14/13.4 


6.16/12.4 


2.08/4.62 


2.03/4.01 


(4) 


6 


9.10/25.4 


9.48/27.9 


9.59/18.6 


8.85/16.1 




2 


0.60/1.93 


0.64/1.73 


0.38/0.75 


0.62/0.91 


Samples 


4 


5.26/11.3 


5.77/11.9 


2.21/4.87 


2.15/4.19 


(5) 


6 


10.0/25.5 


10.7/26.4 


14.8/23.6 


12.7/19.6 




2 


0.31/0.83 


0.41/0.84 


0.39/0.70 


0.60/0.91 


Samples 


4 


5.61/11.7 


6.18/12.1 


2.71/5.13 


2.51/4.42 


(6) 


6 


15.2/31.5 


15.3/30.8 


22.9/31.1 


19.3/25.3 




2 


3.93/11.7 


6.60/16.0 






Cobbs’ 


4 


*** 


69.5/171 


n/a 


n/a 




6 


*** 


*** 








2 


0.31/1.19 


0.32/0.26 


0.59/1.49 


0.45/0.34 


Dfs(a) 


4 


6.39/30.8 


4.31/3.79 


4.15/14.4 


2.49/1.93 




6 


14.6/64.9 


7.76/7.37 


10.7/42.0 


5.87/5.13 




2 


0 . 09 / 0.23 


0 . 08 / 0.07 


0 . 22 / 0.43 


0 . 18 / 0.13 


Dfs(p) 


4 


3.24/6.42 


2.63/2.32 


0 . 76 / 1.92 


0 . 61 / 0.47 




6 


11.3/12.6 


7.76/7.37 


2.26/6.05 


1.59/1.38 



*** One single query took more than 2 hours of elapsed time. 

Table 4 . Query time for medium patterns and for fc = 2 , 4 and 6 . The on-line 
algorithm shows time in milliseconds in the format “user/system”, in italics. 
The indexed algorithms show the fraction they take of the time of the on-line 
algorithm. The format is “a/5”, where a considers only user time and 5 considers 
both. The fastest indexed times are in boldface. 
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Appendix: Probability of Reaching a SufRx Tree Node 

We need to determine which is the probability of the automaton being active at 
a given node of depth £ in the suffix tree. Notice that the automaton is active if 
and only if some state of the last row is active (recall FigureQJ. This is equivalent 
to some prefix of the pattern matching with k errors or less the text substring 
represented by the suffix tree node under consideration. 

We are therefore interested in the probability of a pattern prefix of length 
m' matching a text substring of length £. This analysis is an extension of that 
of P). As Figure Q illustrates, at least f—k text characters text must match the 
pattern when £ > m' , and at least m' — k pattern characters must match the 
text whenever m' > £. Hence, the probability of matching is upper bounded by 
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depending on whether £ > m' or m' > £, respectively (the combinatorials count 
all the possible locations for the matching characters in both strings) . Notice that 
this imposes that m' — k<£<m' + k. We also assume m! > k, since otherwise 
the matching probability is 1. Since k < m' < m, we have that £ < m + k, 
otherwise the matching probability is zero. Hence the matching probability is 1 
for £ < k and 0 for £ > m+k, and we are interested in what happens in between. 



I 




Text substring 



Pattern: m’=9, k=5 

At least 9-5=4 matches 



Fig. 7. Upper bound for the probability of matching. At least max(m' — k,£ — k) 
characters must match, since otherwise it would not be possible to convert one 
string into the other. 



Since we are interested in any pattern prefix matching the current text sub- 
string, we add up all the possible lengths from k to m: 



E 



1 



E 



1 



k — k J \m' — k 

m'=k ^ ^ ^ ^ m'=e+l ^ ^ ^ 

In the analysis that follows, we call (3 = k/£, where a/(l-|-Q;) < j3 < 1. We 
will prove that, after some depth £ in the suffix tree, the matching probability 
is 0(7(/3)^), for some 7(/3) < 1. We begin with the first summation. We analyze 
its largest term (the last one), which is 

1 



( 7 ^ ^ \k J 

and by using Stirling’s approximation x\ = {x/ eY\Z2TTx{\ + 0{\/x)) we have 

\ 2 



1 



£^^/2^£ 



(ji k \]^k(^£_£.y ky2Trky/2TT{£ — k) 



l + 0|i 



0-l-/3^2/3(l _ /3)2(l-/3) 



0-1 ( 



V27r/3(1 - /?) 



O 



which is 
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where the last step is done using Stirling’s approximation to the factorial. This 
formula is of the form 7(/3)^ 0(l/£), where we define 



l{x) 



1 



fjl Xjp.X 






( 1 ) 



The whole first summation is bounded by ^—k times the last term, which gives 
— k)^{PYO{l/ tj = 0(7(/3)^). Therefore the first summation is exponentially 
decreasing with ^ if and only if 7(/3) < 1, i.e. 



cr > 



( ^ 

V/32/5(1-/3)2(1-/3) 



1 

/3^(l-/3)2 



( 2 ) 



It is easy to show analytically that e ^ < 1 if 0 < /3 < 1, so it suffices 

that (j > e^/(l — PY, or equivalently 



(3 < 1 



e 




( 3 ) 



is a sufficient condition for the largest (last) term to be 0(7(/3)^), as well as the 
whole first summation. 

We address now the second summation, which is more complicated. In this 
case, it is not clear which is the largest term. We can see each term as 

1 ft\fk + r 
tj’' \r/ \ k 

where i— k<r<m — k. By considering r = xi {x G [1 — P,m/i — j3\) and 
applying again Stirling’s approximation, we maximize the base of the resulting 
exponential, which is 

cr“a;2^(l — a;)^ ^(3^ 

Elementary calculus leads to solve a second-degree equation that has roots 
in the interval [1 — (3, oo) only if cr < /3/(l — /3)2. Since due to Eq. (0 we are only 
interested in tr > 1/(1 — PY, Sh{x)/6x does not have roots, and the maximum 
of h{x) is at a; = 1 — /3. That means r = £ — k, i.e. the first term of the second 
summation, which is the same largest term of the first summation. 

We conclude that the probability of being active at a node of level £ is upper 
bounded by 

(^l + oQ)) = 0(7(/3)0 

and therefore Eq. 0 is valid for the whole summation. When 7(/3) is 1, the 
probability is very high: only considering the term m' = £ we have f2(l/£). 

Hence, the result is that the matching probability is very high ior P = k/£> 
1 — e/v^, and otherwise it is 0{j{PY), where 'y(P) < 1. 
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Although the e appeared via a bounding condition, we can see that this 
bound is tight: we take log^, on both sides of the condition 7(/3) < 1 and get 

l-/3 + 2(/31og^/3+(l-/3)log^(l-/?)) > 0 

and by replacing x = 1 — 13 and using ln(l — x) = —x + 0{x^) we have 

xlna + 2{xln.x — {1 — x){x + 0{x^)) = xlncr + 2xlna; — 2a; + 0(a;^) > 0 

from where divide by x to obtain 

a; > ® = ^(l + 0(a;)) = ® (1 + 0(1/^^)) 

VO" \/cr 

We conclude that the precise limit for /3 = 1 — a; is 
/3 < 1 ^ + 0(l/cr) 

As we show experimentally in P], however, the real (3 limit is very close to 
the same formula if e is replaced by c = 1.09. The reason is that the bounding 
condition (Figure Q we use is not strong enough: for instance, we could avoid 
replacements in the edit distance and the bound would be the same. In the paper 
we use a limit of the form (3=1 — cj knowing that we can prove c < e but 
in practice it holds c « 1. 
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Abstract. We investigate how the size of the compressed version of a 
2-dimensional image changes when we cut off a part of it, e.g. extracting 
a photo of one person from a photo of a group of people. 2-dimensional 
compression is considered in terms of hnite automata. Let n be the size 
of the smallest acyclic automaton which describes an image T. We show 
that the tight bound for the compression size of a subsegment (subim- 
age) in the deterministic case is 6>(n^'®) and in the weighted case is 
0(n). We also show how to construct efficiently the compressed rep- 
resentation of subsegments given the compressed representation of the 
whole image. Two applications of subsegments compression are more ef- 
ficient automat a- compressed pattern-matching and the first polynomial 
time algorithm for the fully compressed pattern-checking problem for 
weighted automata. 



1 Introduction 

The compression size of images is of crucial importance in multimedia systems 
and in transferring large images in WWW. Deterministic and weighted finite au- 
tomata are successful tools for compressing 2-dimensional images, see i?TO- 
There are several software packages using this type of compression, see pgj . 
Finite automata can describe quite complicated images, for example determin- 
istic automata can describe the Hilbert’s curve with a given resolution, see m, 
while weighted automata can describe even much more complicated curves, see 
also ISH5I . The objects considered are potentially exponentially compressed, 
so algorithms which apply decompression are theoretically not polynomial time 
algorithms. In practice exponential compression does not usually appear, nev- 
ertheless the compression ratio for two-dimensional images can be very high, 
especially compared with the one dimensional case (for example for images corre- 
sponding to fractals having short description) . For one-dimensional words there 
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exist polynomial-time deterministic algorithms for compressed and fully com- 
pressed pattern-matching pil0ll2| . despite the fact that the uncompressed size 
of objects could be exponential. However these problems become much harder 
in the two-dimensional case. Our main result is a constructive proof of the fact 
that compressed size of subsegments of an automata-compressed images grows 
only polynomially. This contrasts with the exponential grow of compression size 
of subimages for compression in terms of recursive description, see P|. Our al- 
phabet is S = {0, 1, 2, 3}, the elements of which correspond to four quadrants 
of a square array, see Figure 



1 


3 


0 


2 



Fig. 1. Enumeration of the quadrants. 



A word w of length k over S can be interpreted, in a natural way, as a 
unique address of a pixel x of a 2^ x 2^ image (array), we write address(x) = w. 
The length k is called the resolution of the image. For a language L C i7+ 
denote by Imagek{L) the 2^ x 2^ black-and-white image such that the color 
of a given pixel x is black iff address{x) G L. We consider also the weighted 
languages, formally they correspond to functions which associates with each word 
w a value weight l {w) . A weighted language L over E and resolution k determine 
the gray-tone image Imagek{L) such that the color of a given pixel x equals 
weight L{address{x)). If all words in L are of the same length k then we can 
omit the subscript k and write Image{L). Our description of the language is in 
terms of finite (unweighted or weighted) automaton A. We define Imagek(A) = 
Imagek{L{A)), where L{A) is the language accepted by A. 

Representation in terms of acyclic deterministic automata is equivalent to a 
representation by a 2-dimensional grammar, each production corresponds to the 
way of decomposing a square into 4 smaller subsquares of a same shape. For the 
automaton from Figure0we can define the subsquare corresponding to state si 
by (0 denotes a blank subsquare of appropriate shape): 




S2 32 



We consider a subsegment image P and the host image T described by au- 
tomata of sizes m and n. Denote by Compress(P) and Compress(T) the au- 
tomata describing P and T, respectively. 




188 Juhani Karhumaki, Wojciech Plandowski, and Wojciech Rytter 



Our main problem is the Subsegment Compression Problem: 

Instance: Compress{T) representation of a 2^ x 2^ square image, a point x 
in T and an integer k' < k. Let i? be a square 2^ x 2^ subsegment of T whose 
left-upper corner is positioned at x. 

Question: What is the size of Compress{R) ? What is the complexity of 
computing Compress(R). 

Another problem is the pattern-checking, it consists in testing a fixed occur- 
rence of a large compressed image.) It is co-NP complete for 2-dimensional com- 
pressions in terms of recursive generations, see j2j- We define the depth of the 
automaton as the longest path from the initial state to an accepting state. An 
acyclic automaton can be transformed to an equivalent automaton in which for 
each state q each path from the initial state to q has the same length. We say 
that a state q belongs to level t if all paths from the initial state to q are of 
length t. Here, the length of a path is the number of edges in the path. In our 
considerations we may restrict to acyclic automata since we consider only finite 
resolution images or more precisely finite resolution approximations of infinite 
resolution images. 

Example. Image{{0, 1, 2}^) = Sk is the 2^ x 2^ black-and-white square part of 
Sierpinski’s triangle, see Figure|2|for the case /c = 4. The corresponding smallest 
acyclic deterministic automaton accepting all paths describing black pixels has 
5 states. 



subsegment R to be cut off 




starting state 



accepting stae 



0 , 1 , 2 , 

sO si 



0 , 1 , 2 , 52 0 , 1 , 2 , 



0 , 1 , 2 , 

s3 s4 



Fig. 2. The image S 4 and its smallest acyclic automaton. Edges which are not 
on accepting paths are disregarded. 
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2 The Subsegment Compression Problem for 
Deterministic Automata 

Let A = ({0,l,2,3},Q,go,S) be a deterministic acyclic automaton of depth n 
where Q is a set of states, go € Q is the initial state, and S : Q x U* —>■ Q is a 
partial transition function. The automaton A defines the language L(A) = {w : 
S(go,w) is defined |t(;| = n and S{go,w)is accepting}. Note, that in the definition 
of an automaton we do not need to specify the single accepting state. For g G 
Q, denote by Image{g) the image which is generated by the automaton which 
is obtained from A by changing its initial state to g. Clearly, Image{go) = 
Image{A). 

A regular block of a 2^ x 2^ image T is defined as follows. T is a regular block, 
and if i? is a regular block then all its quadrants are also regular blocks. A square 
subsegment of the shape 2‘ x 2‘ is said to be of rank t. Denote by 0 a square 
blank block consisting only of white pixels. We use the same notation for all 
possible sizes of 0, the size depends on the context. We can interpret the state 
g as a name of a regular block X = Image{g), we write name{X) = q. If Ai is a 
blank block then we write name{X) = 0. We call states to be essential iff they 
are on a path to an accepting state. 

Lemma 1. Assume the whole image is not totally blank. The number of essen- 
tial states of the smallest acyclic deterministic minimal automaton describing T 
eguals the cardinality of different nonblank regular blocks ofT. 



subsegment R 





Fig. 3. The subsegment R of S 4 and the smallest acyclic automaton describing 
R. 



We illustrate the lemma with the following example. The states at depth 2 
of the automaton from Figure 0 corresponds to the blocks of the subsegment R 
in the way shown in Figure 0 

The crucial notion is that of a pseudo-regular block, defined as a square subseg- 
ment of a rank t -|- 1 of the corresponding image consisting of 4 adjacent regular 
blocks of rank T, the regular blocks themselves are also considered as pseudo- 
regular blocks. Define by Pseudo Jiegt{T) the set of pseudo-regular blocks of 
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Fig. 4. Illustration of Lemma d Regular blocks of R (in bold) are parts of 
pseudo-blocks of T . There are 3 nonblank regular blocks of rank 1 in segment R 
from Figured they correspond to the states of the corresponding automaton for 
i?, each of these blocks is a part of a psuedo-regular block of rank 2 of T with 
the same vector a shown in the figure. We have q5 = Sub(a, s3, s3, 0, s3), q6 = 
Sub(a, s3, s3, s3, 0), q7 = Sub(a, 0, 0, s3, s3). 



rank t of T. For a subblock X of a block X define the position of X in F 
{posx{y)) as the position of left-upper corner of Y inside X. For a block X 
of rank t -|- 1 and vector a define Subt{a, X) to be the sublock F of X such 
that posx{Y) = a. Each pseudo-regular block X is identified by a composite 
name {name{Xi),name{X 2 ),name{X^),name{Xj^)) where Xi,...Xi are regu- 
lar blocks which are quadrants of X, listed in the order corresponding to Figure 
1 . The compressed size of the subsegment can grow since the number of pseudo- 
regular blocks could be much larger than the number of regular blocks in the 
same image. For example there are 2 regular blocks of rank 1 in T in Figure El 
but 5 different pseudo-regular blocks of rank 1 (including blank ones) . 

Lemma 2. 

For a given rank t there is a vector at such that each regular block of rank t in 
the subimage R equals Sub{at, (A, B, C, D)) where (A,B,C,D) is a pseudo-regular 
block of T of rank t -|- 1 . 

Theorem 1. Assume the compression is in terms of deterministic automata. 
The compressed representation of a square subsegment R of T can be computed 
in 0{\Compress{T)\'^-^) time. 

Proof. 

The states of the automaton A' for the subsegment are tuples (ofc, {A, B, C, D)) 
where {A, B,C, D) are pseudo-regular blocks of T of rank fc -|- 1, and Ofc is a 
vector from Lemma El Due to Theorem El the number of pseudo-regular blocks 
is 0(n^'®), where n = \Compress(T)\. 

We compute names of pseudo-regular blocks top down. However we con- 
sider only these pseudo-regular blocks which contain a nonblank block of the 
subsegment. If we know the names of pseudo-regular blocks of rank t -|- 1 then 
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each pseudo-regular block of rank t consists of 4 subblocks of rank f of a single 
pseudo-regular block of rank t+1, the details will be given in the full version. 

3 Tight Bounds for the Compression Size of Subimages 

We need the following technical lemmas. 

Lemma 3. Let ki, k 2 ■ ■ ■ , kr > 0- Then 

r 

+ k‘^_^_-^ + . . . + A;^) < (fci + . . . + kr)‘^'^. 

i^2 

Proof. We omit the technical proof (induction on r). 

Lemma 4. Eaeh pseudo-regular bloek of rank i is a central block of a regular 
block of rank k, or a central block of two adjacent (horizontally or vertically) 
regular blocks of rank k, where k > i. 

Theorem 2. For each subimage TZ of an image T described by a deterministic 
automaton of size n there is a deterministic automaton describing TZ of size 

Proof. By the construction of the proof of Theorem Q it is enough to give an 
upper bound for the number of all pseudo-regular subsquares of T. 

Let psi be the number of pseudo-regular blocks of rank i for 1 < i < r and ki 
be the number of regular blocks of rank i for 0 < i < r. Then due to Lemma ^ 
the number of pseudo-regular blocks can be bounded by the number of pairs of 
regular blocks of rank at least i, hence we have 

pSi = 0(^kf k^.^1 -t- . . . -f k‘(') 

In the same time each pseudo-regular block of rank i is composed of 4 regular 
blocks of rank i — 1, hence pSi < kf_i. Therefore 

pSi = 0{mm{kf_i, k( k)) 

The conclusion of the theorem is now a consequence of Lemma 0 

In the proof of the lower bound we need two generations of size 0{n) which 
composed together give an object whose compression size is I7(n^). We use the 
following fact. Assume here the alphabet consists of all integeres and we have 
morphisms 

h{i) = 2z 2z 2i -|- 1 21 -|- 1; g(i) = 2z 2z -|- 1 2z 2z -|- 1; 

We have 

h^{0) = 0011001122332233; 

52 ( 0 ) = 0101232301012323. 
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Lemma 5. 

Let u = /i^(0) and w = g^(0). Then all pairs are different for 

1 < i < , where n = 2^. 




subsegment R 



middle point ofR 



U 



1 




regular subsquares with 
different small subsquares 




dotted subsquare 




S 



(A) 



(B) 



Fig. 5. The structure of the image T = Ik and the subsegment R (indicated 
in bold). 



The structure of an image whose compression size is n and compression size of its 
subsegemnt is is illustrated in Figure 0 Lemma 0 is used to generate n 

different subsquares on one side of the middle line and n different subsquartes on 
its other side in such a way that we have Q{n^) diferent pairs of these subsquares. 
We can ’’pump” these subsquares in such a way that we receive many different 
subsquares which are of size 0(2^) and which are blank except small shaded 
subsquares. There are J7(n^) different (shaded) small subsquares of T, each one 
consiting of 4 regular subsquares touching the middle line. The distance between 
consecutive small subsquares is 2d, where d = 2v^"^ There are I7(n^) d x d 
regular subsquares C/i, C/ 2 , • • ■ in the subsegment (touching the middle line). Each 
small shaded corner subsquare of Ui is different (there are J7(n^) of them), so 
there are together i7(n^-log(|C/i|)) = I7(n^'^) different regular subsquares (dotted 
subsquares in the figure). Hence the compressed size of the subsegment should 
be I7(n^-®). 

Theorem 3 (lower-bound). 

There is an infinite sequence of deterministic automata of square images T de- 
scribed by deterministic automata such that \Ccmipress(T)\ = n and there is a 
square subsegment R of T satisfying \Compress{R) \ — I7(n^ ®). 
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4 The Subsegment Compression Problem for Weighted 
Automata 

For weighted automata the compression size of the subsegment grows only lin- 
early, this surprising phenomenon is due to the fact that for weighted automata 
we can have many edges from the same state labelled with the same symbol, 
but having possibly different weights. This enables to do operations similar to 
matrix addition, such trick is not possible in the deterministic case. A weighted 
finite automaton describing an image is specified by (see 0 for details): set 
of states Q, the alphabet {0,1, 2, 3}, weight of edges given by the function 
Wa : Q X Q ^ (—00,00) for edges labeled by the symbol a, for a G (0, 1,2,3}, 
a function I : Q (—00,00) called initial distribution function and a function 
F : Q ^ (—00, 00) called final distribution function. 

The weight of a word w = a\a 2 ■ ■ ■ au is interpreted to give a color W (w) for the 
pixel entry (w). It is defined as W{w) = IWaiWa 2 ■ ■ ■ Wa^F. 




pseudo-regular block (A,B,C,D) 
of the image 



block X block(B,v2) block(A,vl) block(D,v4) block(C,v3) 




Fig. 6. Regular block X cut off from a pseudo-regular block of T (as a matrix) 
can be treated as the sum: X = block{A,vl) + block{B,v2) + block{C,v3) + 
block{D, v4). 



For a regular block Y oiT and a vector v denote by block{Y, v) the square 
which is identical to Y on the overlap of Y with the square of the same shape 
shifted from the corner of Y inside Y by the vector v, all other entries are blank 
(contain zeros), see Figure El 

Theorem 4 (weighted automata). 

Assume the compression is in terms of weighted automata. If\Compress{T) \ = n 
then for a square subsegment R ofT we have \Compress{R)\ = 0{n). The Sub- 
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segment Compression Problem for weighted automata for a square subsegment 
R can be computed in linear time. 

Proof, (sketch) 

Let A be the weighted automaton defining T, we identify its states with regular 
blocks. For each rank t a regular block of rank t of subsegment is located in a 
pseudo-regular block of T, where vectors ul, v2, u3, u4 are as in Figure El 

We create the automaton A! for R. Its states are identified with block{Y,v), 
where y is a regular block of T and v is one of the vectors ul, . . . ,u4 which 
depend on the rank. Correctness is based on the fact that the subblock X can 
be treated as a matrix and it is the sum of 4 matrices, as shown in the figure. 
For each rank there are 4 different vectors, so the number of states of A! is 0(n). 
The construction goes top-down similarly as in Theorem fusing the ideas from 
the proof of Theorem 4.2 in El. where atomic dependencies are decomposed 
into smaller ones. In our case atomic dependencies correspond to subblocks of 
the type block{Y,v). We omit the details. 

5 Two Applications of the Subsegment Compression 

We sketch two simple consequences of the subegment compression. The first is 
an improvement (and simplification) upon a similar result in j I I j and the second 
one gives the first polynomial time algorithm for compressed checking problem 
for weighted automata. 

Theorem 5. There is an algorithm for compressed pattern-matching for deter- 
ministic automata working in time 0{n^'^m) where n is the compressed repre- 
sentation ofT and m is the total size of an uncompressed pattern P. 

Proof. In the process of construction the representation for a subsegment we 
compute pseudo-regular blocks. It can be shown that P occurs in T if it occurs 
in a psudo-regular block of rank t -I- 1, where t is the rank of P. Hence we can 
construct all pseudo-regular blocks of rank t -|- 1 and check for them (by known 
linear time algorithms for uncompressed two-dimensional matching) if P occurs 
in one of them. 



Theorem 6. There is a polynomial time algorithm for the fully compressed 
checking problem for weighted automata. 

Proof. 

We use the following result due to jS|. 

Claim. The equivalence of two weighted automata can be checked in polynomial 
time. 

We can construct the compressed representation of the subsegment RoiT which 
is of the same shape as the patern P and starts at the same location. Then we 
check equality of the images R, P in polynomial time due to the claim. 
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Abstract. We present a very efficient, in terms of space and access speed, data 
structure for storing huge natural language data sets. The structure is described 
as LZ (Ziv Lempel) compressed linked list trie and is a step further beyond 
directed acyclic word graph in automata compression. We are using the 
structure to store DELAF, a huge French lexicon with syntactical, grammatical 
and lexical information associated with each word. The compressed structure 
can be produced in 0(N) time using suffix trees for finding repetitions in trie, 
but for large data sets space requirements are more prohibitive than time so 
suffix arrays are used instead, with compression time complexity 0(N log N) 
for all hut for the largest data sets. 



1 Introduction 

Natural language processing has been existing as a field since the origin of computer 
science. However, the interest for natural language processing increased recently due 
to the present extension of Internet communication, and to the fact that nearly all texts 
produced today are stored on, or transmitted through a computer medium at least once 
during their lifetime. In this context, the processing of large, unrestricted texts written 
in various languages usually requires basic knowledge about words of these 
languages. These basic data are stored into large data sets called lexicons or electronic 
dictionaries, in such a form that they can be exploited by computer applications like 
spelling checkers, spelling advisers, typesetters, indexers, compressors, speech 
synthesizers and others. The use of large-coverage lexicons for natural language 
processing has decisive advantages: Precision and accuracy: the lexicon contains all 
the words that were explicitly included and only them, which is not the case with 
recognizers like spell [5]. Predictability: the behavior of a lexicon-based application 
can be deduced from the explicit list of words in the lexicon. In this context, the 
storage and lookup of large-coverage dictionaries can be costly. Therefore, time and 
space efficiency is crucial issue. 



M. Crochemore, M. Paterson (Eds.): CPM'99, LNCS 1645, pp. 196-211, 1999. 
© Springer-Verlag Berlin Heidelberg 1999 




Ziv Lempel Compression of Huge Natural Language Data Tries Using Suffix Arrays 



197 



Trie data structure is a natural choice when it comes to storing and searching over 
sets of strings or words. In the contemporary usage of the term, a trie for a set of 
words is a tree in which each transition represents one symbol (or a letter in a word), 
and nodes represent a word or a part of a word that is spelled by traversal from the 
root to the given node. The identical prefixes of different words are therefore 
represented with the same node and space is saved where identical prefixes abound in 
a set of words - a situation likely to occur with natural language data. The access 
speed is high, successful look up is performed in time proportional to the length of 
word since it takes only as many comparisons as there are symbols in the word. The 
unsuccessful search is stopped as soon as there is no letter in the trie that continues 
the word at a given point, so it is even faster. 

When sets of strings are huge a simple trie can grow to such proportions that its 
size becomes a restrictive factor in applications. A huge data structure that can’t fit 
into main memory means slower searching on disk, furthermore if the structure is 
small enough to fit into cache memory the search speed is increased. Numerous 
researchers did a lot of work on compacting tries, reducing the size and increasing the 
search speed. As there are many possible uses of a trie, most of the compaction 
methods are optimized according to specific application requirements. When data 
must be handled dynamically (databases, compilers) trie has to support insertion and 
deletion operations as well as a simple lookup; the best results in trie compaction, 
however, are achieved with static data. Few examples of work on dynamic trie 
compaction are [3], [7], [8], [15]. Static tries are used successfully in a number of 
important applications (natural language processing, network routing, data mining) 
and the efforts in static trie compression are both numerous and justified. Although 
researchers usually try to establish as good trade-off between speed and size as 
possible, in most of the work emphasis is on one of the two. Two examples of work 
where the speed is of main concern are [2] where search speed is increased by 
reducing the number of levels in a binary trie and [1] where trie data structures are 
constructed in such manner that they accord well with computer memory architecture. 
When the size of the structure is of primary concern the work is usually focused on 
automata compression. With natural language data significant savings in memory 
space can be obtained if the dictionary is stored in a directed acyclic word graph 
(DAWG), a form of a minimal deterministic automaton, where common suffixes are 
shared [4], [12], [13], [17]. 

Majority of European languages belong to a family of languages where (i) most of 
the words belong to a set of several morphologically close words (inflectional 
languages), and (ii) the differences between two such morphologically close words is 
usually a suffix substitution (suffixal inflection). That accounts for good results with 
automata minimization, on the average a substantial portion of a word is overlapped 
with other words’ prefixes and suffixes. However, this works well only for simple 
word lists used mainly in spelling checkers, for most other applications (dictionaries, 
lexicons, translators) some additional data (lexical tags, index pointers) has to be 
attached to the word sharply reducing the overlapping of the suffixes. The additional 
data can be efficiently incorporated in the trie by more complex implementation [16] 
or by using the hashing transducers. The hashing transducer of a finite set of words 
was discovered and described independently in [13] and [17]. This scheme 
implements a one-to-one correspondence between the set of N words and the set of 
integers from 1 to N, the words being taken in alphabetical order. The user can obtain 
the number from the word and the word from the number in linear time in the length 




198 



Strahil Ristov and Eric Laporte 



of the word, independently of the size of the lexicon therefore producing a perfect 
hashing. The transducer has the same states and the same transitions as the minimal 
automaton, but an integer is associated to each transition. The number of a word is the 
sum of the integers on the path that recognizes the word. Once the number of a word 
is known, a table is looked up in order to obtain the data associated with the word. 

In this paper we investigate a new method of static trie compaction that reduces the 
size beyond that of minimal finite automaton and allows incorporating the additional 
data in the trie itself. This involves coding the automaton so that not only common 
prefixes or suffixes are shared, but also the internal patterns. The procedure is best 
described as a generic Ziv Lempel compression of a linked list trie. Final compressed 
structure is formally more complex and has less states than minimal finite automata 
used in [4] and [13]. Particularly attractive feature is a high repetition rate of 
structural units in compressed structure that enables space efficient coding of the 
nodes. The idea has been informally introduced in [18] and [19]. Flere we shall 
describe the method in more detail and demonstrate how it performs when used for 
storing DELAF, a huge lexicon of French words. We also present some compaction 
results for various natural language data sets. For the sets on which previous work has 
been reported in the literature our results are significantly better. 

In section 2 we present our method and introduce notation we use throughout the 
article. Two essentially similar algorithms for compression are described in section 3, 
the first one is simpler and slower, the second one much faster but requires more 
space. We also explain some heuristic for simplification of the algorithms and 
propose a related problem as an open problem in theory of NP completeness. In 
section 4 we describe experimental data sets, among them a huge French lexicon, and 
present compression results. Conclusion is in section 5. 



2 Overview of the Linked List Trie LZ Compression 

A trie T is a finite automaton and is as such defined with the quintuple 
T = (Q, A, q„, 5, F], where Q is a finite set of states, A is an alphabet of input 
symbols, q„ g Q is the initial state, 5 is a transition function from Q x A to Q and F c 
Q is the set of accepting or final states. When trie T is produced from a set of words 
W, then W is the language recognized by T. 

Natural language data usually produce very sparse tries that lend themselves to 
various possibilities for space reduction with retained high access speed. Sparseness 
of a tree is a strong indication for employing the linked list data structure in 
representation of the nodes. When linked list is used it is convenient to associate 
symbols of alphabet with the levels rather than with the transitions in the trie. In this 
case levels are represented with lists of structural units where four pieces of 
information (Fig. la) are assigned to each unit: 

1 . a symbol (letter) a g A; 

2. a binary flag / indicating whether a word ends at this point (corresponding to a 
final state); 

3. a binary flag c indicating whether there is a continuation of valid sequence of 
symbols past the current unit to the next level below; 
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4. a pointer / to the next unit at the same level (if null, there are no more elements on 
the current level); if we use addressing in number of units, the size bound for I is 
the number of units in T. 

A linked list trie is then represented with a sequence or a string of units. Now, the 
units themselves can be regarded as symbols that make up a new alphabet U and the 
implemented trie structure can be defined as a string. 

DEFINITION: Linked list trie LET is a string of symbols u from alphabet U. If we 
denote by N the number of structural units in LET then: 

LET = I 6 U, N = |LLT| where 

u. = af.c^l^ I OjE A, ^ G {0, 1 }, Cj G {0, 1 }, 0 < /j < N 

To illustrate this, in Fig. lb units of the trie from Fig. la are replaced with a new set 
of symbols yielding a string representation of LET. Of course, when each of their 
parts are identical, two units are identical too and consequently represented with the 
same symbol. 

As on any string, some compression procedure can be attempted now on LET. 
Particularly natural approach is to use LZ paradigm of replacing repeated substrings 
with pointers to their first occurrences in the string [23]. The general condition for 
compression is that the size of pointer must be less than the size of the replaced 
substring. We used the constant and equal size units for representation of the elements 
of U and the pointers so that compression is achieved whenever repeated substring is 
of size 2 or more elements. In Fig. Ic repeated substrings are replaced with 
information in parenthesis about the position of the first occurrence of repeated 
substring and it’s size. The first number designates the position in (compressed) string 
and second the length of replaced substring. Note that the first occurrence of a 
substring can include a pointer to the previous first occurrence of a shorter substring. 

DEFINITION: Let Is^ be the length of i-th substituted substring in LET and K be the 
number of substitutions. Then, reduction in space R = E (Is^ - 1), for i = 1 - K. Let 
LLTC denote the compressed linked list trie such as that of Fig. Ic. The size N^, of 
compressed structure is then N_, = |LLTC| = N - R, and the compression ratio C = 1 - 
N/N. 

All size values are given in number of structural units. For the example of Fig. Ic, 
R = 9 and C = 1 - 1 1/20 = 45%. 

The sequence in Fig. Ic is a simplified representation of a compressed trie structure; 
look up for the input is not performed sequentially as it may seem suggested by the 
Figs, lb and Ic, but still by following trie links. Only now when, in reading the 
structure, at the position P, a pointer unit (P„, Is,) is encountered, reading procedure 
jumps to the position P„, and after Is, units read, jumps back to the position P, + 1. 
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b) abcabdaebaebabdaebae 



c) a b c (1,2) d a e b (6,2) b (4,8) 




Fig. 1. a) A trie of four words {abaabaab, abaabbab, abbabaab, abbabbab) is presented in a 
graphical arrangement that points out its sequential features. Final states are indicated by thick 
circles; horizontal arrows represent c flags; inflected arrows represent I pointers. Structure is 
traversed by following the arrows and comparing the current input symbol with one in the trie, 
if symbols don’t match and there is no I pointer from the current unit then input is rejected. The 
input sequence is accepted if it leads to a final state, h) LET represented with new set of 
symbols; identical units are replaced with the same symbol, c) Compressed representation of 
LET string. The first number in parenthesis is the position of the first occurrence of 
repeated/replaced substring, the second number is the substring’s length, d) Implementation of 
compressed structure includes two types of pointers; S signs indicate pointers that replace 
whole branches and K sign stands for pointer that replaces only a portion of a branch and 
carries the information about its length (2 in this case). Inflected arrows below indicate the 
paths the reading procedure must follow in the structure. Full lines indicate one-way directions, 
dashed lines indicate directions implied by K pointer. 



The actual implementation of LET compression is more complex than in 
straightforward application of a LZ procedure on a string in Fig. Ic where there’s no 
difference in treatment of repeated substrings. The underlying structure of LET is that 
of a tree and this divides repeated substrings of EET into two categories depending on 
whether the repeated substring represents a complete branch of a tree or just a portion 
of a branch. Only for this latter case should the pointers carry the information about 
the number of replaced units; when the whole branch is replaced, every possible 
continuation of the current input is contained in the first occurrence of the substring 
and there is no need for coming back to the original position of a pointer. Second and 
third pointers of Fig. Ic replace whole branches of the trie and the first one substitutes 
only a part of a branch. This EETC sequence with two types of pointers then might 
look like this: abc(l,2)daeb(6,_)b(4,_) where indicates that there is no possible 
need for coming back. The Fig. Id shows how actually the structure of Fig la is 
compressed with two different types of pointers. 
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DEFINITION: Let’s call one-way pointers pointers that replace whole branches and 
two-way pointers those that replace only parts of branches. Let’s say that a substring s 
= of LLT is closed if no unit m, e s contains ale m, pointer that points 

outside s, and there is no continuation to the next level from the last unit of s. That is: 

s is closed if V /, e g s | value {1) < position {u^) and c,^ = 0. 

Otherwise let’s call s open. One-way pointers replace closed repeated substrings, two- 
way pointers replace open repeated substrings. 

THEOREM: Replacing every closed repeated substring of LLT with one-way 
pointers produces DAWG for a given set of words W. 

Proof: DAWG for a set of words W is the minimal finite automaton recognizing all 
the words in W. Minimization is obtained by merging all the equivalent states of the 
automaton. 

If two states are reached by sequences Sj and s^ they are equivalent if for every 
sequence z holds that if s^z is in W then s^z is also, and if SjZ is not in W neither is s^z. 
Since substrings of LLT replaced with one-way pointers are identical it is obvious that 
they carry identical partial transition function, and since they are closed there exist no 
other unknown suffixes so the repeated states are indeed equivalent. 

The additional compression, above that of automata minimization, is achieved with 
introduction of two-way pointers capable of replacing open substrings of LLT. It is 
worth noting that the formal complexity of compressed structure is then higher than 
that of finite automaton. States replaced by two-way pointers are not equivalent in the 
finite automata sense and some conditional branching is introduced in the procedure 
of reading the structure. For example, after reading b in the second position on Fig. Id 
further direction depends on whether this is the first time read or the read directed by 
the pointer at the fourth position. This type of decision is beyond the power of finite 
automata. 



3 Algorithms 

We first present a simple quadratic algorithm for producing LLT from W and then 
replacing repetitions with pointers. Denote with Sj e LLT a substring of units starting 
at the position i in LLT. Let E be a relation of substring prefix equality on LLT such 
that S|ESj means that there are at least two first units of s^ and Sj that are equal. That is: 
S|ESj => = u....u^ and k > 2. Let R be the relation of substring substitutability 

where sRs- means that s^ can be replaced with the pointer to Sj. For the algorithmic 
complexity reasons R covers smaller class of LLT substrings than E; this will be 
explained a bit latter. The algorithm is then as follows: 

ALGORITHM A: 

sort W 

build LLT(W) 
for i = 1 to N - 3 
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for j = i+2 to N - 1 
if sESj 
if s,RSj 

check whether substitutable substrings are open or closed 
replace S- with the appropriate pointer 

end 

Building of LLT(W) is a straightforward procedure of building a trie that can be done 
in 0(N) time and will not be explained here. Initial sorting of W is the simplest way 
of preventing following situations to occur: Let M be the number of words in W, w_^ | 
m < M denote m-th word in W, and LLT_^ a linked list trie with m words built into it. 
If has a prefix that is also a word of W such that 

k < m and w_^ is not a prefix of then there is no place for the suffix of that 
is the difference between and w^. This suffix should find its place right after w^, 
but since there is already at least one word not prefix of in the structure, this 
place is occupied. Situation like this would require usage of additional pointers in the 
construction of the trie and it is more economical instead to arrange the input order in 
a way to avoid this. The simplest way to do this is to sort W before building LLT(W), 
then if words exist in W that are prefixes of other words they are all grouped together 
and any existing prefix of is at the end of LLT_^. 

The central part of presented algorithm has clear quadratic time complexity. 
Double loop of comparing each position in LLT with every other to check whether 
they are the starting positions of equal substrings takes NV2 iterations. (The inner loop 
is only shifted to the right by two - the minimum size of substitutable substrings.) The 
procedures of checking whether repeated substrings are open or closed and replacing 
them with pointers are done only once for each replaced substring so they add to the 
overall complexity only a linear factor proportional to R. The average input sorting 
procedure is done in 0(M log M) time and the total time complexity for producing 
LLTC from W is then 0(M log M + N + + R) with O(N^) being by far the most 

important bound. In practice this simple procedure is fast enough for smaller data sets 
such are smaller simple word lists with high prefix repetition rate that produce smaller 
tries. Unfortunately, for bigger sets of entries that do not share too many common 
prefixes, and therefore produce huge tries, the exhaustive quadratic procedure is not 
feasible. 



3.1 Speed Up via Suffix Matching 

Speed up is possible and in fact a linear time bound can be achieved using suffix tree 
for finding repetitions in LLT. The idea of assisting LZ compression with suffix tree 
search has firstly been presented in [21]. A suffix tree of all suffixes in LLT can be 
built in 0(N) time, all the repetitions in LLT are then associated with the nodes in the 
suffix tree and easily found in linear time [11]. The problem with building suffix trees 
is that they require to much space when alphabet is large as is the alphabet of all 
different units of LLT, and for this case a better approach is to use suffix arrays [14]. 
A suffix array for LLT is an array of starting positions in LLT of sorted suffixes of 
LLT. Sorting is on the average done in 0(N log N) time and then all repeated 
substrings are grouped together in suffix array. 
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Now the problem rests of finding the best candidates for replacement with pointers 
among the substrings grouped together. The simplest way to do this is to delimit 
groups of suffixes in suffix table that have at least two first elements identical and 
then to perform quadratic search only on elements in the group. These groups should 
be sorted according to the suffix starting position in LTT so that search and replace 
procedure can be done in consecutive order from the beginning of the structure. This 
is important because it avoids considerable expense of keeping track of all the 
changes in the structure that can interfere with incoming replacements. Overall, this is 
much faster way to find possible candidates for the substitution with pointers than the 
exhaustive quadratic search of Algorithm A. The procedure is then: 

ALGORITHM B: 

1 : sort W 

2: build LLT(W) 

3: build suffix_array(LLT(W)) 

4: define partitions of suffix_array(LLT(W)) that comprise two or more entries 

with identical first two units 

5: sort the suffixes in partitions according to their position in LLT(W) 

6: from the first to the last element in partitions compare each element to every 

other from the same group | check whether substitutable substrings are open 
or closed | replace substitutable substrings with the appropriate pointers 

Time complexity of comparing substrings at suffixes’ starting positions to possible 
candidates for replacement within the groups is still quadratic but with much smaller 
base. If there are G different groups of suffixes with identical beginnings in 
suffix_array(LLT(W)) and SGj, i = I - G is the number of elements in i-th group, the 
time complexity of step 6 is 0( E SG^). For real data the size of any group is much 
smaller than N so this improves strongly on time requirements of Algorithm A. The 
price is paid in space used for suffix array and tables needed for storing and searching 
groups. There is also the sorting of the groups procedure of step 5 that requires 0(E 
SG. log SGj) time so the total time complexity of Algorithm B is 0(M log M + N + N 
log N + E SGTog SG; + E SG^^ + R). When running the experiments it is apparent that 
steps 3 to 5 consume most of the running time of Algorithm B for values of N up to a 
million. Only for tries with more units quadratic time complexity of step 6 becomes 
increasingly important. However, these are the values of N where the difference in 
complexity of Algorithms A and B matters the most. For the biggest tries that we 
experimented with (N = 16 million) estimated run time of Algorithm A is 250 times 
longer. 

If some additional structures are used to mark already replaced substrings then E 
SG;^ factor can be improved to E RG^ where RG; is the number of substitutions 
actually performed in i-th group. This has not been justified experimentally since 
Algorithm B already uses considerably more space than Algorithm A and for large N 
values the size of additional structures may become a restricting factor. 
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3.2 Bounds for Compression of LLTC 

LLTC produced by Algorithms A or B is not necessarily the smallest possible 
structure of this sort recognizing W. There exist one obvious structural limitation for 
compression - a constant size of unit, and some algorithmic limitations that are 
imposed for the sake of the algorithmic simplicity. 

Size of Structural Units. If the size of structural unit is kept constant, which 
immensely simplifies and speeds up the look up procedure, then the bound for the size 
of each unit is the size of units holding the largest numerical information. There are 
two types of structural units in LLTC: the symbol units, same as those of LLT that 
carry the symbol code a, f and c flags and the I pointer, and the pointer units that are 
either one- or two-way pointers replacing repeated substrings in LLTC. The size limit 
for symbol unit in bits is given by [ logAl -i- 1 H- 1 H- [ log nJ and this limit is forced 
onto pointer units too. Pointer units carry information about the address of the first 
occurrence of substituted substring, about its length (if two-way) and some 
information that distinguishes them from symbol units. In symbol units either /or c 
flag or both must be 1 (true) because the word can only end with the current symbol 
or be continued to the next one. Therefore combination of two zeros for /and c flags 
is impossible in symbol units and this is used as an indication that the current unit is a 
pointer. The bound for the size of the address of the first occurrence of replaced 
substring is f log nJ again, so this leaves [ log a 1 bits in pointer units for storing the 
length of replaced substring for two-way pointers. This was enough for every data set 
we have experimented with so far. LLTC normally supports embedded pointers, i.e. a 
pointer can point to a sequence of units that contains another pointer, and this can 
have many levels. For reasons of space economy we are storing in two-way pointers 
only the number of units that has to be followed on the first level which is usually 
considerably smaller than the full length of the replaced substring. Apart from this 
little trick there is another reason why \ log a 1 bits are enough for two-way pointer 
information - the longest substituted substrings are usually closed and are therefore 
replaced with one-way pointers. The problem with constant size units is in that when 
N_, is big, most of the / pointers are much smaller in value and a considerable amount 
of space is wasted. If this becomes critical it is always possible to use variable size 
coding of units or, which should be the best solution for the overall reduction of 
redundancy in LLTC, to use additional table for minimal size coding of units 
described latter in section 3.3. 

Algorithmic Complexity Constraints on Possible Substring Substitution. There 
are three algorithmic limitations to compression of LLTC arising from its underlying 
tree structure and they are defined with the following rules: 

Rule 1. If the repeated substrings overlap, then shorten them so that they don’t. 

Rule 2. If s, = is a repeated substring and / e has value(/) > i-tlsH-1 then 

shorten s, to Sj^= u^...u^. y 

Rule 3. If Sj = is a repeated substring and there exists e | h < i, such 

that value(/) = k | i-tl < k < i-tls, then shorten s^ to Sj^= i. 

The above three rules account for the aforementioned difference between classes of 
equal and substitutable substrings of LLT. If these rules are not observed situations 
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would be occurring that would require complicated procedures to solve while at the 
same time not improving much on the compression. If overlapping of replaced 
substrings is allowed it would take great pains to avoid never-ending loops and the 
savings in space would be only one unit per occurrence. (If overlapping is allowed 
pattern.pattern.pattern can be replaced with pattern.pointer, and if not, with 
pattern.pointer. pointer with only the cost of one pointer increase in space.) Hence the 
Rule 1. 

Rule 2 prevents the substitution of a substring s^ that contains a / pointer pointing 
out of S; by more than one. This is necessary because it is possible that substring of 
LLT between the end of s^ and the position the I pointer points to can latter be 
replaced with another pointer unit and then the value of / won’t be correct anymore. 
To account for that a complicated and time costly checking procedure should be 
employed and the savings would be at most two units per occurrence. (If k = i -t Is 
then only unit is not included in the substituted substring, if k = i -t Is - 1 then the 
loss is two units and if k < i -t Is - 1 then the part of s^ behind is a new 

repeated substring and can be replaced with a new pointer so the loss is again only 
two units.) 

Rule 3 for the similar reasons shortens s, up to the position pointed to by some / 
pointer positioned before s,. If s, is replaced then this / value wouldn’t be correct 
anymore and the necessary checking would be unjustifiably costly. Analogously to 
Rule 2 the loss in compression is at most two units per occurrence. 

It should be noted that situations where Rules 1, 2 and 3 come to effect occur 
seldom enough in natural language data that we have been experimenting with so far. 
Apparently, application of these rules worsens the compression by not more than 3%. 
Input Ordering Problem. Apart from that, there exists a serious algorithmic 
impediment in optimization of LLTC compression introduced by the order of input 
words when building LLT. Fortunately, this has only a theoretical importance and 
carries a little weight in practice. Let us consider a special case where W can be 
divided into a set of distinct partitions Wj, Wj e W, such that every word in Wj has the 
same length L and differs from other words in W, only in the last letter. Let P, denote 
a sequence of units in LLT that represents a common prefix of words in Wj, then 
length(Pj) = Lj - 1 . Let denote the unit representing the last letter in word w^^ e Wj 
where k = 1 - Ki, Ki = |Wj. Suppose that no word in W is a prefix of another word in 
W, then when a linked list trie is built each subset W, produces a LLT branch of type 
Units corresponding to the last letters in words are connected with / 
pointers of value one and are identical in every aspect but for the symbol content 
throughout all the subsets W^. Ordering of the sequence of units has no bearing on 
the content of LLT, it is determined by the ordering of input words which can be 
arbitrary since no word of W is a prefix of another word in W. Now, the problem is 
how to order sequences of units in such a way as to obtain the highest possible 
compression achieved by replacing substitutable substrings in LLT with pointers. We 
haven’t been able to find an efficient solution for this problem and we suspect it is 
NP-hard. We haven’t been able to prove that neither so we propose this as an open 
problem in theory of NP completeness. Reduced for the simplicity it can be stated as: 

INSTANCE: Finite set of variables V and a collection T of triples (v^, v^, v,) from V. 
For each triple holds a statement 

V Z v^ and v Z v, 
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where Z stands for any transitive, asymmetrical and irreflexive relation such as 
‘smaller than’, ‘bigger than’, ‘has lower/higher rank’ etc. 

QUESTION: Is there an assignment of values to variables in V such that the number 
of statements (or triples) that are satisfied is not less than a given integer I < |T|? 

The order of input words may therefore have influence on how well the linked list trie 
is compressed. With actual natural language data this is not an important factor, the 
lexicographical sort of input results in highly repetitious LET structure and this 
normally solves the problem well enough. When we investigated possible variations 
between worst and best case orderings on actual data the difference in size of 
compressed structures could never be above 2%. 



3.3 Minimal Size Unit Coding with Table Lookup 

An interesting and exploitable feature of LLTC is a high repetition rate of identical 
units throughout the structure. Apparently, lexicographic sort of input records 
combined with employed linked list representation produces a high level of structural 
unit repetitions in both LET and LLTC. This effect gets more pronounced with larger 
data sets. Eor example, in a compressed trie of over 2 million elements only about 
200,000 units are different. A simple and very effective coding of the units is 
therefore possible for reducing redundancy in the structure. If all the different units 
are stored eparately in a table of size ND X (unit size), where ND is the number of 
different units, then LLTC can be represented with an array of N pointers of size [ log 
ND1 bits. On top of this, up to two bits per table unit can be saved by using their 
position in table instead of flags. In most cases table coding leads to important savings 
in space and the time needed for table lookup only about halves the search speed, as 
indicated by our experiments. 

The compressed structures produced with Algorithms A or B are very compact and 
fast to search. Typical access speed for LLTC is measured in tens of thousands of 
found words per second. This is fast enough for any real time application, even for 
those that rely on an exhaustive search in space of similar words. In the following 
section we describe some actual data sets and present results of compaction 
experiments. 



4 Data Sets and Experimental Results 



4.1 Natural Language Lexicon 

A simple spell-checker needs only to recognize whether a word belongs in the 
vocabulary of the language or not. In that case, the states of the automaton 
recognizing a word set are classified as final or non-final. Lor most other applications, 
correct words need to be assigned a lexical tag with a grammatical content: part of 
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speech (noun, verb...), inflectional features (plural, 3rd person...), lemma (e.g. the 
infinitive for a verb). For instance, woods should be assigned a tag like woot/.N:p (i.e. 
the noun wood in the plural). A minimal automaton can still represent a dictionary 
that assigns tags to words. Two methods are used to allow for tags in the dictionary. 
In the first [17], [20], tags are associated to states; the automaton has multiple 
finalities, i.e. the number of finalities is not necessarily 2 (final/non-final) but the 
number of tags. In the second method [12], tags are considered as parts of dictionary 
items. In both cases, minimization is still possible and time efficiency is preserved, 
but the minimization is less efficient in space, since common suffixes are no longer 
shared when the words have different tags (e.g. ds in the noun woods and in the verb 
adds). 

When the linguistic information in the tags is limited to basic grammatical 
information, the number of possible different tags can remain small and these 
solutions are still optimal. The limit is reached when more elaborate information is 
included into the tags, namely syntactic information (number of essential 
complements of verbs, prepositions used with them, distribution of subjects and 
complements). When this information is provided systematically, the number of 
different tags comes close to the number of words, and beyond because this level of 
description requires more sense distinctions [9]. Consequently, the minimal 
automaton grows nearly as large as the trie. However, the variety of labels used in 
tags is more limited and there exists a substantial amount of substring repetition in 
lexical entries. For this reason LLTC structure seems like a natural choice for storing 
lexicons. 

We used LLTC for compressing a comprehensive dictionary of French, the 
DELAF [6]. This dictionary lists 600,000 inflected forms of simple words. It is used 
by the INTEX system of lexical analysis of natural-language texts [22]. Linguistic 
information are attached to each form: parts of speech (noun, verb...); inflectional 
features (gender, tense...); lemma (e.g. the infinitive in the case of a verbal form); 
syntactic information about verbs. In case of ambiguities, appropriate sense 
distinctions are made. The syntactic information attached to verbal forms is derived 
from the lexicon-grammar of Erench, a systematic inventory of formal syntactic 
constraints: number of essential complements, prepositions used with them, 
distribution of subjects and complements etc. [10]. The size of DELAF in text format 
is 21 Mbytes and a typical example of three entries in DELAF is presented in Fig. 2. 

abandon, .N:ms 

abandonna,abandonner.V-i-t-i-32CL-i-32H-i-36R-i-38LR-i-38L1-i-6-i-9:IPA3s/abandonner.V-i- 

{s‘~}-i-i-i-7:IPA3s/abandonner.V-i-i-i-31H-i-35R:IPA3s 

abandonnai,abandonner.V-i-t-i-32CL-i-32H-i-36R-i-38LR-i-38L1-i-6-i-9:IPA1s/abandonner.V 

-i-{s'~}-i-i-i-7:IPA1s/abandonner.V-i-i-i-31H-i-35R:IPA1s 

Fig. 2. Three entries in DELAF lexicon of French words with attached grammatical, 

syntactical and lexical data. 

Three things are obvious from this example: first, the amount of repeated substrings is 
high; second, a simple DAWG would be of little use since the endings of entries are 
highly diversified (i.e. there are not too many equivalent states in finite automaton 
produced from DELAF); and third, a trie produced from entries such as those on Fig 2 
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will be huge. The first two facts speak in favor of trying to store DELAF in LLTC, 
but the third presents a problem. A huge LLT means a huge N and the quadratic part 
of compression algorithm becomes important. In fact, with Algorithm B the 
compression time for LLT(DELAF) was 5.5 hours on a 333 MHz PC running Linux. 
In Table 1 we present all the relevant numbers for experiments with DELAF and 
other data sets. 

The compressed size with table unit coding is 5.5 Mbytes. This is a considerable 
improvement over currently used format with tags stored separately that is over twice 
that size. Reduction in size can be important in integrated applications where lexicon 
is only a part of the system (computer-aided translation, natural language access to 
databases, information retrieval). The five and half hour compression time is 
acceptable for this instance because it is unlikely that data sets of this type will be 
updated on the run. The search speed is high enough for every possible application. 



4.2 Other Data Sets 

In order to demonstrate the potential of our method for compressing static dictionaries 
we present in Table 1 experimental results for seven additional natural language data 
sets. Six are publicly available and some compression results have already been 
published for two of them. Here are the brief descriptions: 

- DELAF word forms: all the simple French word forms without any additional 
information, extracted from DELAF 

- Calgary bookl 7-tuples: a list of all successive seven-tuples from bookl of Calgary 
corpus; the compressed size of this set as reported in [7] is about 2.5 M 

- words: a list of English words found in /usr/dict/words on Unix systems (older 
release); the compressed size of this set as reported in [13] is 112 K 

- linux. words: a list of English words found in /usr/dict/linux. words on Linux systems 

Moby words simple: a list of simple English words from 

http://www.dcs.shef.ac.uk/research/ilash/Moby/mwords.html 

- Moby words compound: a list of compound English words from 

http://www.dcs.shef.ac.uk/research/ilash/Moby/mwords.html 

- Moby words all: combined simple and compound word lists of above. 

The compression times and search speeds were measured on 333 MHz P II PC 
under Linux OS. The compression times given are for Algorithm B steps 3 to 6, i.e. 
without initial sorting of input entries and building the trie. Search speed is calculated 
by measuring the time needed for reading all the input words from disk and looking 
them up in the compressed structure loaded in the main memory. The first, most 
densely populated, level of the compressed trie is accessed through the array of 
starting positions for each letter instead of searching the list. This speeds up the search 
for up to 20% with the space overhead of only 512 bytes for the array (if long integers 
are used as pointers to starting positions of different letters in LLTC). 

In standard coding of LLTC units node sizes are rounded to a whole byte for 
optimum speed and simplicity. In some cases this is a considerable waste; for 
instance, Moby data largest pointer units require 26 bits, leaving 6 bits per 4 byte unit 
unused. In structures with minimal coding all elements are coded with minimum 
number of bits. Only a small overhead of few bytes is necessary for denoting table 
and array element sizes, as well as the distribution of various pointers in the table. 




Table 1. Experimental results for various natural language data sets 
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5 Conclusion 

Experimental results presented in Table 1 show that our method exhibits considerable 
potential for storing natural language data, for inflected languages more than for non- 
inflected - the French word forms set compresses considerably better than the sets of 
English words. Still, it performs well for every set tested. The only data sets we could 
find with previously published results (words and 7-tuples) compress better than 
previously reported. One would expect that increased number of words would always 
lead to a better overlapping of substrings. It is therefore somewhat surprising that 
combined sets of Moby simple and compound words do not compress better than 
when separated. Also, although we are satisfied with the final result, the huge number 
of different tags in DELAF did not compress as well as we expected. When partitions 
of DELAF (even as small as 10,000 entries) are compressed separately the 
compression ratio is roughly the same as for the whole set. Obviously, with LLTC 
compression, as with any compression method, the degree of success depends on the 
actual data. Overall, we believe that presented method of LZ linked list trie 
compression can be successfully used for storing and accessing data in various natural 
language related applications. 
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Abstract. In this paper, we study pattern matching of points under 
non-uniform distortions. First we give a natural definition for the prob- 
lem. Next we present a simple polynomial time algorithm for the one- 
dimensional case of the problem, whereas we prove that it is NP-hard 
in two (or more) dimensions. Then we present a practical heuristic al- 
gorithm for finding a matching between two sets of spots obtained by 
the two-dimensional gel electrophoresis technique, which is a special but 
important case of the problem. 



1 Introduction 

Matching of spatial point sets (i.e., comparing two sets of points) is an important 
pattern matching problem, and thus many studies have been done in computa- 
tional geometry ITfflTI and pattern recognition mm. 

In most studies in computational geometry fH3l7) . only uniform transforma- 
tions (e.g., translations, rigid motions and/or scalings) were considered. However, 
in some applications, non-uniform distortions may occur and thus pattern match- 
ing based on local similarity is important. Pattern matching of spots obtained 
by the two-dimensional gel electrophoresis technique is an important example of 
such applications where we are also developing a system named DDGEL 

0 for analysis of two-dimensional gel electrophoresis image obtained from ge- 
nomic DNA by means of the RLGS (Restriction Landmark Genomic Scanning) 
method [5]. In this application, positions of spots are distorted non-uniformly 
and thus the methods developed in computational geometry are not directly 
applicable. 

On the other hand, in pattern recognition (and in image analysis of elec- 
trophoresis data), many studies have been done for pattern matching under 
non-uniform distortions mmm although most of them are heuristic. 



M. Crochemore, M. Paterson (Eds.); CPM’99, LNCS 1645, pp. 212 -^ 7 ^ 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 
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Appel et al. considered transformation based on second-order and third-order 
polynomials in order to cope with non-uniform distortions appearing in elec- 
trophoresis image data j2] . However, their method (and many of other methods 
for electrophoresis image analysis) uses so-called landmarks in order to find poly- 
nomials, where landmarks are spot pairs intensively marked in both images by 
the user and selected as putative matching pairs. 

Several groups applied Delaunay graphs (Delaunay triangulations) and/or 
relative neighborhood graphs for point matching under non-uniform distortions 
In most of such studies, a Delaunay graph (or a relative neighborhood 
graph) is first computed from each set of points, and then a maximum common 
subgraph (or a similar structure) between two graphs is computed. However, 
finding a maximum common subgraph is time consuming (it is NP-hard in gen- 
eral) and thus various heuristics are employed in Delaunay based approaches. It 
is natural question whether or not such a time consuming procedure is essential 
for point matching under non-uniform distorsions. This question is a theoretical 
motivation of this study. 

This paper consists of two parts: theoretical part and practical part. 

In the theoretical part, we give a natural definition for point matching under 
non-uniform distortions, where similar formalizations are given in jpllij . We 
present a simple polynomial time algorithm for the one-dimensional case of the 
problem, which is similar to well-known DP (dynamic programming) algorithms 
for approximate string matching and sequence alignment. On the other hand, 
we prove that the problem is NP-hard in two or more dimensions. This result 
answers the above question: time consuming search procedures such as finding a 
maximum common subgraph are essential for point matching under non-uniform 
distortions unless P = NP. 

In the practical part, we show a heuristic method for spot matching for two- 
dimensional electrophoresis gel image data. Although this method is heuristic, 
it uses an algorithm which is an extension of the DP algorithm for 1-D (one- 
dimensional) case. The method is implemented in the DDGEL system and is 
being tested using real gel image data. 



2 Point Matching Under Non-uniform Distortion 

2.1 Definition of the Problem 

Let P = . . . ,p^} and Q = {q^,...,q„} be point sets in d-dimensions, 

respectively. We call a set of pairs M = {{p ^^ , J, . . . , (pj^ , )} a matching if 

(V/i ^ k){p,^ yf p,^ and q^^ q^-J. 

Definition 1. (Point Matching Under Non-uniform Distortion) 

Point matching under non-uniform distortion is, given a positive real e, two point 
sets P = {pi, . . . and Q = {q^, . . . , q„} in d-dimensional Euclidean space, 
find a maximum matching M = {(Pij,qjJ, . . . , (Pi, ,qy,)} (i-e., a matching M 
with the maximum cardinality) satisfying 
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where \p — q\ denotes the Euclidean distance between p and q. 

Note that P and Q can be exchanged in the above definition because ^ f ^ 
1 + e if and only if < ^ < 1 + e. Note also that, in this definition, local 
similarity must be preserved because error for two point pairs must be small if 
distances between points are small. 



2.2 A Simple DP Algorithm for 1-D Case 

For 1-D case, point matching under non-uniform distortion can be solved by the 
following simple dynamic programming algorithm, where only the procedure for 
computing the scores of point pairs is shown. In the following, we assume that 
points are already sorted in the ascending order (i.e., Pi < P 2 < ■ < Pmi 

Ql < 92 < • • • < Qn)- 



for t = 1 to m do D[i, 1] ^ 1; 
for j = 1 to n do ^ 1; 

for i = 2 to m do 
for j = 2 to n do 
begin 

maxD ^ 0; 
for k = 1 to i — 1 do 
for /i = 1 to j — 1 do 
1 . ig,-9hi 






IP p I < 1 -I- e and D[k, h] > maxD 
then maxD ^ D[k, h]; 

D[i,j] ^ maxD + 1; 

end; 



It is obvious that the algorithm works in 0{m^v?) time. The correctness of 
the algorithm follows from the proposition below. 



Proposition 1. If both 



l-l-e 



< 



\P^^-P.f < -L + e onc( < i + e 



hold where i\ < 12 < is and j\ < j'2 < j'3, then 



1 +-= ^ \P.^-PiA 



< 1 -I- e holds. 



2.3 NP-Hardness Result for 2-D Case 

In this subsection, we prove the following theorem. 

Theorem 1. Point matching under non-uniform distortion is NP-hard in d- 
dimensions where d > 2. 
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{ a,b, c ] [c ,d,e 

— Q ^ Q— 



6 0 ^ 0 



[ b,c .e } 



Fig. 1. Example of a grid embedding of a planar graph for a 3SAT instance 
{{a,b, c},{c,d,e},{b, c,e}} 



Proof. Since details are complicated, we only show a sketch of the proof. We use 
a polynomial time reduction from PLANAR 3SAT [ni!- Let C = {ci, C 2 , . . . , cat} 
be an instance of PLANAR 3SAT over the set of variables V = {vi,V 2 , ■ ■ ■ ,vk}, 
where we assume that each clause Ci consists of 3 literals. Note that, in PLANAR 
3SAT, graph G(V U C, E) must be planar (see Fig. 1), where E = {{ui, Cj}\vi G 
CjOrlE&Cj} U {{uj,Uj+i}} U {{vi,vk}}- 

From this instance, we construct an instance (P, Q, e) of the point matching 
problem. The construction will be made up of several components, which can 
be partitioned into three parts, grouped according to their intended function: 
“truth-setting” components, “satisfaction-testing” components, and “routing” 
components. 

First we describe “satisfaction-testing” components (see also Fig. 2) since 
these are the core parts of the construction. For each clause Ci, we construct a 
set of points Ti = qf , qf , qf , qf , where 



P? = (0,0), 

p1 = (o,l), 
q\ = (0,-aL), 



„2 _ ( -/3L _L\ 

yi \ 2 ’ 2 

^2 / y/3aL olL \ 

Hi ~ \ 2 ^ 2 



„3 _ ( VSL _L\ 

Pi ~ \ 2 ’ 2 / ’ 

^3 / otL \ 

Hi — \ 2 ’ 2 



q“ = (0,(l-a)L), qf = ^ 

q]f = (0, (1 + a)L), qf = (_ q.^ = 

and each Ti is to be translated to an appropriate position. 
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Fig. 2. “ Satisfaction-testing” component 
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Here we let e = 2a. Then we can see the following relations hold for small a 

(a < 0.22): 



From these, the followings must be satisfied (in order to make \M\ = |P|): 



Here we assume that Ci = Then, case (a) corresponds to a case 

where vi is satisfied, case (b) corresponds to a case where V 2 is satisfied, and 
case (c) corresponds to a case where V 3 is satisfied. 

Next we describe “truth-setting” components. For each variable Vi, points as 
shown in Fig. 3(a) are constructed, where the points are partitioned into three 
subsets: Pi,Ql,Q{ ■ It is easy to define coordinates so that if at least one point in 
Pi corresponds to some point in Q\ (resp. Q{), then Pi must correspond to Q* 



Next we describe “routing” components. Here we assume that a grid embed- 
ding of G{V U C, E) is already obtained as shown in Fig. 1. Note that a grid em- 
bedding of size 0{N) X 0{N) can be computed in linear time from G{V U C, E) 
0. According to this embedding, we connect “truth-setting” components to 
“satisfaction-testing” components. In oreder to connect “truth-setting” compo- 
nents to “satisfaction-testing” components, the following constructions are im- 
portant: (i) copying a truth assignment; (ii) inverting a truth assignment on Vi 




< (1 -I- e)L ), 




(resp. Q{). 



(a) 



(b) 



false 



false 



o ! I X ! j o 




o ! j X ! j o 



I o ! I X ! I o ! 

true j O ! j X ! j O ! false 

I o ! I X ! j o ! 



o ! I X ! j o 



Qj 



o ! j X ! j o 



true 



false 



P 



Fig. 3. (a) “Truth-setting” component and (b) “Routing-component” for copy- 
ing a truth assignment 
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Fig. 4. Connection from “truth-setting” components to a “satisfaction-testing” 
component 



(i.e., creating uf); (iii) connecting a truth assignment to a “satisfaction-testing” 
component, (i) can be done as in Fig. 3(b) and (ii) can be done in a similar way. 
(iii) can be done as in Fig. 4, where each bold line denotes a sequence of points. 

Then, it can be proved for an appropriate value of e that there exists a 
maximum matching M satisfying \M\ = \P\ if and only if there exists a truth 
assignment satisfying all clauses in C. 

Although we omit details, the total number of created points is polynomially 
bounded and thus the construction can be done in polynomial time. □ 



3 A Practical Algorithm for 2-D Gel Image Data 

Although the DP algorithm in Section 2.2 is valid only for 1-D case, the idea can 
still be used as a heuristic for matching of 2-D points. In this section, we describe 
a pattern matching method based on such a heuristic algorithm. The method 
is implemented in the a- version of the DDGEL system (http://bonsai.ims.u- 
tokyo.ac.jp/cgi-bin/ddtop/cgi-bin/index.cgi) jS], where DDGEL is an image anal- 
ysis system for 2-D gel electrophoresis obtained from DNA by means of the RLGS 
(Restriction Landmark Genomic Scanning) method j^. 

The matching method consists of two major steps: finding an initial matching 
and finding the final matching. In the first step, we find a rough matching between 
two point sets. In the second step, we first transform P according to the result 
of the initial matching and then we compute and refine matchings. The heuristic 
algorithm mentioned above is used in the first step. The first step corresponds 
to the matching by landmarks used in several practical systems |2). But, in our 
case, we do not need landmarks. 

Note that, in analysis of 2-D gel images, the spot detection step is also required 
in order to extract spots from the original gel images, where we use standard 
image processing techniques for this step. Since spot detection is out of the scope 
of this paper, we do not describe the details of spot detection. 

In the followings, for a point p, {p)^ (resp. {p)y) denotes a;-coordinate (resp. 
^-coordinate) of p. If {p)x > {q)x and {p)y > {q)y hold, we write p> q. 
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3.1 Finding an Initial Matching 

In point matching of 2-D gel image, we consider the L\ distance 



di{p,q) = \{p)x - {q)x\ + \{p)v - {q)y\, 



instead of L 2 distance used in Section 2, because a 2-D gel image is usually 
obtained by using two enzymes: one for the direction of X-axis and the other 
for the direction of F-axis. Although we do not yet prove an NP-hardness result 
for this case, this case seems to remain NP-hard. 

For this case, we have the following proposition: 



Proposition 2. If both < 1 -I- e and < 

hold where p^^ >~ p^^ >~ p^^ and q^^ > q^^ >~ q^^ , then < 

holds. 



diiPi^,P.,) 



<l + e 
< 1 -k e 



Based on this, we can obtain a longest sequence J, . . . , {p^^ , J) sat- 
isfying (V/i ^ /c)(^ < <! + «)> ^ PJ and (V/i)(q,,^^ F 

qjh), by means of a simple DP algorithm similar to that in Section 2.2. 

However, in this case, we obtain a matching only for points in a diagonal-like 
region. Such a matching is not sufficient for an initial matching. Therefore, we 
use the following DP algorithm for finding an initial matching. In this case, there 
is no longer any theoretical guarantee for the obtained matchings. However, it 
worked well when the number of insertions and deletions of points was not large 
and the distortion was not very large. 

for t = 1 to m do 
for j = 1 to n do 
begin 

score ^ 0; maxscore ^ 0; count ^ 0; 

for each Pf. e neighbor s{pf) do 
for each q^ S neighbors{qj) do 

if di(pfc - Pi, q^ - q^) < «i • \pj, - p,\ then 
begin 

iiS[k][h] > maxscore then maxscore ^ 
count ^ count 1; skip to next Pf.] 

end; 

‘^Wb] ^ maxscore 2 ■ count -k 1; 

end 



In the above, neighbor s{p) denotes the set of Xi-nearest points p' of p such 
that p >- p' , where K\ is a constant (we use K\ = 6 in the current implemen- 
tation). ai is also a constant depending on the size and the density of input 
point sets, where we use a.\ = 0.4 in the current implementation. In order to 
find matching pairs in non-diagonal regions, we use maxscore -k 2 • count -k 1 as 
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a score, where there is no concrete reason for using this value. The procedure 
works in 0{mn) time for a constant Ki. Note also that only the procedure for 
computing scores of point pairs is described in the above. A set of point pairs 
(i.e., an initial matching) is obtained from the scores by a traceback-like method, 
where we omit details here. 

3.2 Finding the Final Matching 

In the second step, we compute the final matching based on the initial matching 
found in the first step. As in the first step, there is no theoretical guarantee for 
the obtained matchings. Since several heuristics are used in the second step, we 
only describe an outline of the procedure. We call that p is locally similar to q, 
if the following condition is satisfied: 

\{Pi\ Pi G neighbors{p) and (3q^- e neighbors{q)) 

( di{p, - p, qj - q) < 02 • \Pi - P| ) }| > K2 

where 02 and K 2 are constants depending on the size and the density of input 
point sets. We use 02 = 0.25 and K 2 = 4 in the current implementation. In this 
case, neighbors{p) is the set of lO-nearest neighbors of p. 

(1) From the set of pairs { J , . . . , {p^^ , q^-^ ) } found in the first step, com- 

pute the affine transformation TR : (x,y) — > (ax + b,cy + d) such 
that J2k P is minimized, by means of the least squares fitting 

method. Then, we apply TR to P. 

(2) Apply the DP algorithm to P and Q again and execute (1) again. 

(3) For each point p G P, find a corresponding point q G Q (if there exists) 
such that |p — q| < D and p is locally similar to q (currently we use D = 20). 

(4) Apply the local transformation to each point p G P, where the local trans- 
formation is computed from neighbors(p) and neighbors(q) by means of the 
least squares fitting method. 

(5) Repeat (3) and (4) for several times. 

3.3 Examples 

Here, we show examples where the above method was applied to point sets 
obtained from real RLGS image data. Although we applied the method to several 
data, we only show two typical examples here. The method is implemented on 
a SUN Ultra-2 workstation with 300MHz CPU and 640MByte main memory 
using C-language. 

The first example (see Fig. 5) is an easy case: the number of insertions and 
deletions of points is small and the distortion is small. In this case, 1052 pairs of 
matching points were found where P consisted of 1128 points and Q consisted 
of 1148 points. It took 32.7 sec. in total. 

The second example (see Fig. 6) is a rather difficult case: either the number 
of insertions and deletions or the distortion is not small. 
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Fig. 5. Example 1. The point set on the left hand side (P) consists of 1128 points 
and the point set on the right hand side (Q) consists of 1148 points. In this case, 
1052 matching pairs were found. 




Fig. 6. Example 2. The point set on the left hand side (P) consists of 1363 points 
and the point set on the right hand side (Q) consists of 1682 points. In this case, 
824 matching pairs were found (see also Fig. 7). 
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Fig. 7. Result of point matching in Example 2, where one point set (P) is trans- 
formed by a non-uniform transformation. 



A good matching was still found in this case: 824 pairs of matching points 
were found where P consisted of 1363 points and Q consisted of 1682 points. 
It took 57.7 sec. in total. The result of the matching is shown in Fig. 7, where 
P is transformed by a non-uniform transformation generated by the matching 
program. 
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Abstract. In dendrochronology wood samples are dated according to 
the tree rings they contain. The dating process consists of comparing the 
sequence of tree ring widths in the sample to a dated master sequence. 
Assuming that a tree forms exactly one ring per year a simple sliding 
algorithm solves this matching task. 

But sometimes a tree produces no ring or even two rings in a year. 
If a sample sequence contains this kind of inconsistencies it cannot be 
dated correctly by the simple sliding algorithm. We therefore introduce 
a 0{o?mn + a'^{m + n)) algorithm for dating such a sample sequence 
against an error-free master sequence, where n and m are the lengths of 
the sequences. Our algorithm takes into account that the sample might 
contain up to a missing or double rings and suggests possible positions for 
these kind of inconsistencies. This is done by employing an edit distance 
as the distance measure. 



1 Introduction 

1.1 Dendrochronology 

The tree ring structure in wood samples is important in many research areas, for 
instance in archaeology, climatology, geomorphology and glaciology. The reason 
for that is that the growth of a tree and therefore its rings depend on the envi- 
ronmental conditions that the tree has been exposed to, so that the tree rings 
build an archive of these environmental conditions. The science that deals with 
the dating of tree rings in order to answer questions related to natural history 
is called dendrochronology. The name is derived from the greek words dendron 
(wood), chronos (time) and logos (the science of). 

A tree ring is a growth layer that the tree forms under its bark during the 
vegetation period. It consists of big, thinwalled cells that are built at the begin- 
ning of the growth period and of thin, thickwalled cells built at the end. The 
first type of cell ensures the food supply to the shoots, whereas the other type 
accounts for the stability of the stem. Since the second type of cell looks much 

* Part of a research project supported by Deutsche Forschungsgemeinschaft, grant AL 
253/4-2 

M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. 223- 0^^ 1999. 

(c) Springer- Verlag Berlin Heidelberg 1999 
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darker than the first type, it is possible to visually detect the border between 
two successive tree rings. In areas with an annual vegetation and winter period 
a tree usually adds exactly one tree ring per year. 

In dendrochronology a wood sample is characterized by the sequence of its 
tree ring widths Q. Since trees growing under similar conditions (especially cli- 
matic conditions like rainfall) build similar tree rings, it is possible to success- 
fully compare certain tree ring sequences. In fact, the usual way of dating tree 
ring sequences in dendrochronology is to compare the undated sequence to a 
dated sequence. This procedure, called crossdating, is a fundamental task in 
dendrochronology. 



1.2 Crossdating 

Assuming that the trees being considered have built exactly one ring each year, 
a crossdating can be performed by sliding the sample along the master sequence 
starting and ending with a certain constant minimum overlap of e.g. 50 rings. At 
each position the distance (according to a predefined distance measure) between 
the overlapping parts of the sequences is computed and the position yielding 
the best distance is proposed as the correct dating position. The most common 
distance measures are the t-value, and the so-called Gleichldufigkeitskoeffizient 
(percentage of slope equivalence). 

Let X = xq, . . . ,xn-i and y = yo, . . . ,yN-i be the two sequences being 
compared in one step of the algorithm. Then the t-value {Student’s t) is defined 

by 



t = r 






) 



(1) 



where r is the correlation coefficient 



N-l 

{xi - x){yt - y) 

In-1 n-i 
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i—0 i—0 

N-l N-l N-l 

N Y Xiyt - Y Xi - Y Vi 

i—0 i—0 i—0 

I N-l A^-1 N-l N-l 

liNYx^-iE x.Y){N Yvf-ij: VrY) 

2=0 2=0 2=0 2=0 



( 2 ) 



( 3 ) 



^ Depending on the application there are also other tree ring characteristica than the 
width used (see e.g. El), but in this article we will regard tree ring widths only. 
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with the arithmetic means x and y. The Gleichldufigkeitskoejfizient Glk is the 
percentage of slope equivalence of the two sequences, 

^ N-2 

Glk = ^ ^ x{x^+l - Xi 

1 N-2 

= E E 

k——l i—0 

with the characteristic function x(a = b) = 

A sequential computation of all distances takes 9{nm) time, where n and m 
are the lengths of the master and the sample sequences, respectively. Considering 
m a non-sequential computation of all correlation coefficients depends on the 
efficient calculation of the correlation terms since all other terms can 

be computed in linear time. Note that the inner sum in © is also a correlation 
term. Employing the Fast Fourier Transform (FFT) all such correlation terms 
can be computed in time 9{{n + m)log{n + m)) instead of the brute force 9(nm), 
see e.g. Due to the discretization of the slope of the sequences the 

Gleichlaufigkeitskoeffizient usually gives less information than the t- value. 

Since the data is very noisy it usually does not suffice to simply date the 
sample sequence according to the crossing position which yields the best dis- 
tance. The results of the matching algorithm are always visually checked by a 
dendrochronologist . 

Before the above described comparison can be made, the tree ring sequences 
have to be filtered. This so-called standardization process cleans the data from 
individual trends, which usually are long term trends. Thus only general trends 
which occur in several tree ring sequences remain. Typically high-pass filters 
like the percentage of a five-year running mean or the logarithmic difference are 
used. 



= Vi+I 
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1.3 Missing and Double Rings 

However, the assumption made above that a tree ring sequence contains exactly 
one value per year is not always true. First of all mistakes during the measure- 
ment of the ring widths happen, especially if the rings are very thin. Moreover, 
due to bad growing conditions a tree might also not build a ring around the 
whole stem or even not at all, which can result in a missing ring in the tree ring 
sequence. Climatic changes can also cause a tree to build two rings a year, a 
double ring. 

If the sequences to be compared contain missing or double rings most match- 
ing algorithms do not produce satisfying results since they do not take into 
account the transposition in time which is caused by a missing or a double ring. 
The usual approach to date a sample sequence which may contain inconsistencies 
against a clean master sequence is to split up the sample into shorter parts and 
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to date each part on its own (either manually or using Cofecha 0). Finally a 
possible position for a ring insertion or merging is manually concluded. Cofecha 
0 is a quality control tool which checks a set of dated samples for mutual dating 
consistency by splitting up each sequence into small pieces and comparing these 
to the other sequences. This leads to a lot of information to be evaluated. The 
information needed to deduce a possible missing ring (i.e. when the pieces to the 
right of the missing ring position date all to one year later) is then available, but 
a missing ring is not explicitly proposed. 



2 Edit Distances in an ck-Box 
2.1 A Simple Edit Distance 

Let A = ao,...,a„_i and B = bo, bm-i be two standardized tree ring width 
sequences, where A may contain missing or double rings (representing the sample 
sequence) whereas B is known to be a clean reference sequence (representing a 
part of the master sequence). In order to get a notion of how much A differs 
from B we look for a transformation transforming A by inserting rings (which 
compensates a missing ring) or merging two rings into one (which compensates 
a double ring) into a sequence close to B. Closeness is defined by taking the 
sum of the squared differences. The transformations allowed are described by 
transformation sequences over the alphabet {/, M, TV} where I stands for insert 
M for merge and N for identity operation. 

Let for example be A = 3, 2, 1, 2, 5, 3 and consider a transformation sequence 
r = MMNIIN, see Fig.Ql The transformation is performed sequentially from 
left to right by merging 3 and 2 to 5 (M), merging 1 and 2 to 3 (M), not changing 
5 (N), inserting a ring which is done by taking the average p, = = 4 of the 

two surrounding rings (/), inserting another ring in the same manner (/) and 
finally not changing the last ring 3 (A). 

X M M N I I N 

A 3 2 1 2 5 3 

x(A) 5 3 5 4 4 3 

Fig. 1. An example of a sequence A, a transformation sequence r and the 
transformed sequence t{A). 



The simple edit distance Dsimp{A, B) is defined by 



m—1 

( A U\ J fffin X/ ~ B.if) , liTn^rn 7 ^ 

^ simp\.A^T ^ ^ 0 



v—0 

not defined 



( 6 ) 



otherwise 
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where 'Tn.m contains all transformation sequences (which we identify with a 
transformation each) that transform a sequence of length n into a sequence of 
length m. We call a transformation sequence which minimizes the sum optimal. 
For a transformation r S Tn^m let 7 t be the number of merge operations, Lt tbe 
number of insert operations and Vr the number of identity operations in r. Then 
from the definition of r follows n = 2 * 7,- + as well as m = jr + + Vr = 

n + ir — Jt- Making use of these properties it is easy to show that Tn^m is non- 
empty if and only if n < 2m. This notion of an edit distance is based on the 
edit distance for strings (see Q, m) or on the dynamic time warping in speech 
recognition (see uni, respectively. 

The profit of taking the sum of the squared distances as a minimization 
criterion is the existence of a recurrence which leads to an efficient computation 
of the simple edit distance. Define Dsimp{i,j) ■= Dsijnp{A[0..i — l],B[0..j — 1]) 
to be the simple edit distance of the prefixes of A and B for i < 2j. Then 
the definition of the transformations by transformation sequences implies the 
existence of the following recurrence: 

D,,mp{0,0)=0 

{ ^simpii 2, J 1) “t” i^i—2 T ^i—1 ^j—l) j 

^ simpi^i fj J 1) (^i— 1 1 ) J 

DsimpiiJ - 1 ) + (m(* - 1 ) - 
for all 0 < z < n, 0 < j < m with i < 2j. 





t 



Fig. 2. Matrix for the dynamic programming computation of the simple edit 
distance Dsimp{A, B). The left cut-off is caused by the condition i < 2j, the 
right by {n — i) < 2(m — j). 



Although the transformation space is exponentially big a dynamic program- 
ming approach allows to compute DgimpiA, B) in 6(nm) time and space. This 
is accomplished by sequentially filling an (n -|- 1) x (m -|- 1) matrix (see Fig. |3) 
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in which cell (i,j) contains the value Dgimpii, j)- The condition i < 2j and the 
symmetrical condition (n — i) < 2(m — j) cut two corners off the matrix which 
represent undefined or for the computation of Dsimp{n,m) unnecessary values, 
respectively. According to o a value is computed out of at most three values 
(see Fig. The value Dsimp{A, B) = Dsimp{n,m) is placed in cell (N,M). 





= M = 



= D. (i,j)=D (i-2,j-l)+(a + a- b y 

Simp simp i-i 1-1 j-1 




= D4ij)=D4i,j-l)+(p(i-l)-h,)2 



Fig. 3. Dynamic programming computation of the simple edit distance accord- 
ing to Q). 



An optimal transformation can be retrieved from the filled matrix by back- 
tracking the performed computation. This is done by starting in cell (n, m) and 
recursively checking which of the three possible cells contributed its value to 
the examined cell (either by recalculating the sums or by checking a previously 
saved pointer/arrow to a cell). In this way a path of cells (or arrows, see Fig. 
01) from cell (n, m) to cell (0, 0) is constructed which obviously corresponds to a 
transformation sequence. 



2.2 Van Deusen’s Edit Distance 

The transformation space over which the simple edit distance is minimized in- 
cludes in particular transformations containing many edit operations. Transfor- 
mations like this correspond to paths in the computation matrix with many 
non-diagonal arrows. Since a tree ring sequence usually contains only very few 
missing or double rings. Van Deusen P reduced the transformation space by al- 
lowing only those paths in the matrix which stay inside a given strip of constant 
width around the diagonal starting at (0, 0); see Fig. P 

The width of a strip is given by a parameter a which denotes the width on 
each side of the diagonal. The in this way reduced transformation space contains 
only transformations in which the edit operations are locally balanced. Yet there 
are still transformations with many edit operations possible. For instance if the 
transformation sequence alternates between a merge and an insert operation the 
conforming transformation path still stays inside a strip of width 1 around the 
diagonal. 
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t 



Fig. 4. Van Deusen’s computation matrix with a strip of width a = 2 to each 
side of the diagonal. The diagonal has been shaded. 



2.3 ct-Box Edit Distance 

A straight forward improvement of Van Deusen’s edit distance is the following 
notion which we call a-box edit distance or k-edit distance. This type of edit 
distance has been proposed for strings by Sankoff and Kruskal 0. Since the 
number of edit operations contained in an optimal transformation should be 
small, the idea is to regard transformations and edit distances depending on the 
number of edit operations. We therefore define the k-edit distance as follows: 



D{A,B,k) 



m — 1 

min X; 
not defined 



, otherwise 



( 8 ) 



where Tn,m,k is the set of all transformation sequences transforming a sequence 
of length n into a sequence of length m using exactly k edit operations. Let again 
7r be the number of merge operations, l-t the number of insert operations and 
the number of identity operations in r. Then from the definition of r follow 
m = Jr + I'T + k>r = n -\- Lr — Jr and ir + 7 t = k. It is then easy to show that 
is non-empty if and only \im>k and m — n = k — 2j for a 7 G {0 , . . . , k}. 
We define D{i,j, k) to be the fc-edit distance between the prefixes A[0..i — 1] and 
B[0..j — 1]. Just as in the case of the simple edit distance the fc-edit distance 
satisfies the following recurrence: 



D(0,0,0) = 0 

D{i - 2, j - 1, fc - 1) -I- (oi_2 + Oi-i - 
D{i -IJ - l,fc) -I- (a*_i - 
D{i,j - l,fc - 1) -k (/i(i - 1) - 



k) = min 



( 9 ) 
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for all 0 < i < n, 0 < j < TO 

with j > k and j — i = k — 2j for a 7 G {0 , . . . , k}. 

The a-edit distance can be computed in a dynamic programming manner in 
0(a^ min(n, to)) time and space. The storage required is a part of an (a + 1) x 
(n+1) X (to+ 1) box (see Fig.0 in which cell (i, j, k) contains the value D{i,j, k). 
Due to the condition j — i = fc — 2y for a 7 G {0, . . . , fc} the defined values of 
D{i,j,k) form diagonals inside the matrix (see Fig. 0 and Fig. EJ. For each 
7 G {0, . . . , A:} there is one corresponding diagonal in level k. All edit distances 
in one diagonal contain the same number of merge and insert operations as 
shown in Fig. El 



level k 



level k-1 




Fig. 5. Dynamic programming computation of the Ic-edit distance according to 
( 0 . A projection of the two levels onto one level results in the matrix shown in 

Fig.EI 



According to 0 the value D{i,j, k) is computed out of at most three values 
(see Fig. 0 whereby a change of the A:-level is performed only in the case of 
an edit operation (merge or insert). The computation is carried out by filling 
the diagonals level by level, thereby touching each cell only a constant number 
of times, until finally filling cell {n,m,a). Note that this computation box (a- 
box) contains especially all fc-edit distances between A and B with 0 < k < a. 
In each A:-level there are at most fc + 1 diagonals and each diagonal contains 
at most min(n,TO.) + 1 cells. Therefore there are (min(n,TO.) + 1) + 1) 

= 0(a^ min(n, to)) cells to be filled. 

Often an optimal transformation contains two opposite edit operations (in- 
sert/merge or merge/insert) almost successively. A pair of edit operations like 
this has no global effect on the edited sequence, but only a local effect during 
the short time interval between the two edit operations. Given a threshold (e.g. 
10) for the minimum number of years between two opposite edit operations, 10 
cells on the diagonal after an edit operation are marked so that the opposite edit 
operation is not allowed when calculating the edit distances for these cells. 
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Fig. 6. Dynamic programming box needed for the computation of the 3-edit 
distance between two sequences of length 4 and 7. The defined cells have been 
shaded. The number of merge (M) and insert (/) operations corresponding to 
the diagonals have been written next to each diagonal. 
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Although theoretically there can be several optimal transformations associ- 
ated with one edit distance (this corresponds to more than one arrow leaving 
one cell), we always choose exactly one optimal transformation per edit distance 
(or cell), since due to the real valued input data an exact equality of the sums 
is unlikely. Then instead of storing in each cell a pointer to the cell which con- 
tributed its value to the sum, we collapse a path of diagonal pointers to one 
pointer (shortcut) directly pointing to the position where the transformation 
path changes the A:-level. That way a traceback of the transformation path of a 
cell in level k needs 0(k) time and a transformation can be saved for later use in 
9{k) space. 

3 Crossdating Employing fc-Edit Distances 

Crossdating is usually performed by sliding the sample sequence across the mas- 
ter sequence and computing distances between the overlapping parts at each 
crossing position. The same is done in the algorithm presented here using the k- 
edit distance and therefore taking into account possible missing or double rings. 
The parameter a must be specified by the user in advance. As a postprocessing 
step a new heuristic is applied in order to further restrict the number of edit 
operations being contained in a transformation sequence. 



3.1 Simple Crossdating by Sample Sliding 

Let us take a closer look at the a-box which has to be filled in order to compute 
the Qf-edit distance between the sample and a specific (coherent) piece of the 
master sequence. Assume the master piece starts at year y. Then the last row of 
each A:-level, 0 < k < a, contains all fc-edit distances between the sample and all 
prefixes of the master piece, while each last column contains all A:-edit distances 
between the master piece and all prefixes of the sample. If the master piece is a 
suffix of the master sequence (or if its length is the length of the sample plus a), 
the a-box contains all possible fc-edit distances between the sample and those 
master pieces which date the sample to year y. 

When we slide the sample across the master, at each position computing an 
a-box for the sample and the suffix of the master sequence, especially the last 
row and last column of each level, we obtain all information we need to date 
the sample: The last rows contain results comparing the sample to the master 
sequence at different offsets, where each offset is represented in one a-box. The 
last columns are interesting only in the case that the sample partly overlaps the 
end of the master sequence, so that a prefix of the sample must be considered. 
In the case where the master piece is longer than the sample, the last column 
degenerates to one cell which is already part of the last row. 

After having computed all edit distances in the last rows and last columns 
of each level in each a-box, we need to compare them in order to find an ac- 
ceptible dating. A simple comparison is not meaningful, since the number of 
terms adding up to a fc-edit distance in J3) varies according to the number and 
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kind of edit operations performed and also according to a partial overlap of the 
sequences. So we need to normalize the edit distance by dividing by the number 
of added terms which is the length of the transformed sample sequence. How- 
ever, a simple comparison of all normalized edit distances (which means sorting 
them and taking the smallest as the best) proved not to be useful. The reason 
for that is that the normalization removes the information about the length of 
the sequence (i.e. the number of summands), so that shorter sequences cause a 
better edit distance more easily than longer sequences do. 

Since the t- value is length-dependant and a commonly used distance measure 
in dendrochronology (which dendrochronologists are familiar with) we take the 
t-value between the transformed sequence and the master piece as the judging 
criterion. The correlation coefficient can be implicitly calculated during the box 
filling process at no extra cost asymptotically. So we sort all t-values of the last 
row and last column of each box and output the biggest t-values, each together 
with an optimal transformation (i.e. the positions of possible missing and double 
rings) and the corresponding dating proposal (offset) to the user. 



3.2 Heuristic Postprocessing of the Results 

The algorithm described so far simply sorts all results in the end without taking 
the number of edit operations into account. Therefore the best results will often 
contain too many edit operations. A standard approach to that problem is to 
penalize edit operations either by a multiplicative or an additive term. Unfor- 
tunately this also affects edit operations at correct positions. Indeed, it seems 
that in respect to penalties incorrect edit operations are somehow more robust. 
We therefore decided not to penalize the edit operations, but we compare the 
obtained results in a heuristic postprocessing step. We store all edit distances 
of all last rows and last columns of each level of each a-box in an overall result 
structure. Then usually a good dating appears several times among the best 
results. Those similar results then differ only in some edit operations, whereby 
they usually share some edit operations (the correct ones) and include some more 
edit operations which improve the edit distance a little but which are incorrect. 

The heuristic we developed then tries in two phases to identify some possible 
redundant results and deletes those from the overall result structure. For each 
edit distance result (that is an edit distance in the last row or column of a box- 
level) we do a redundancy check within the box plus another check concerning 
some neighboring boxes. 

In the first check phase it is tested, if the normalized distance is significantly 
smaller than each normalized edit distance associated with each last cell on the 
diagonals on the transformation path. This captures the idea that one good 
match often appears several times, where the different occurences share some 
correct edit operations and include some more incorrect edit operations. An 
additional edit operation should therefore be admitted only if it improves (hence 
decreases) the edit distance significantly. A normalized edit distance e is said to 
be significantly smaller than the normalized edit distance ecomp, if ^l&comp < 0.9. 




234 



Carola Wenk 



If an edit distance did not pass this check it is deleted from the overall result 
structure and not compared to other edit distances anymore. 

The second check phase is established to eliminate those inter-box redundant 
results, which date the sequence incorrectly by a few years according to super- 
fluous edit operations at the beginning of the transformation sequence. Call the 
subsequence of the transformation sequence which includes only the insert and 
merge operations an edit sequence. For every prefix of the edit sequence the 
time transposition it induces is calculated (a merge operation corresponds to a 
transposition one year to the left, an insert operation one year to the right). 
The a-box at the transposed position is checked if it contains an edit distance 
^comp with an edit sequence equal to the remaining suffix of the edit sequence 
being checked. Now the normalized edit distance e whose edit sequence probably 
contains an unnecessary prefix is deleted if e/ccomp >0.9. 

3.3 Crossdating Algorithm 

Figure 0 shows the crossdating algorithm. The standardization can be done in 
9{m + n) time and space. The number of a-boxes to be filled is 0{n + m), and 
since we need 9{a^ mm{n,m)) time to fill one a-box (see Par. 12.311 . we can fill 
them all in 0{a^mn) time. We do not need to store all a-boxes since we need 
only the last row and the last column of each level of every a-box. There is one 



Crossdating algorithm: 

Standardization of the master and the sample sequence 
For all overlap positions of the sample in the master 
Fill a-box 

For all cells in the last row and last column of each level 
Normalize edit distance 
Compute optimal transformation 

Redundancy check 1: Check with normalized edit distances on 
transformation path. 

If edit distance is not redundant: 

Store the distance, the t-value, the transformation and the offset 
number in an overall result structure. 

Redundancy check 2: Remove inter-box-redundant results. 

Sort all results in the result structure by decreasing t-value. 

Display the best results (those with the highest t- values). 



Fig. 7. Crossdating algorithm 



last row or last column entry for each diagonal, so that we only have to count 
the diagonals of which there are at most (fc -|- 1) in every fc-level. So we have 
0{{m + n) J2k=oi^ + 1)) = 0{a^{m + n)) edit distances to be stored. But for 
each edit distance we also store its corresponding optimal transformation which 
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needs 9{k) space, so that the space which is altogether needed to store all results 
sums up to 0{{m + n)(l + + 1)^)) = 0{a^{m + n)). Additionally there 

is 0{a^ min(n, m)) space for one a-box needed to fill a box. 

The first redundancy check needs 9{k) time for each edit distance which sums 
up to 0{{m + n)(l + + 1)A:)) = 0{a^{m + n)) altogether. The second 

redundancy check needs 0{k"^) time each, hence together 0{{m+n){l+Y^‘^^^{k+ 
1)A:^)) = 0{d^{m + n)). For the redundancy checks there is asymptotically no 
more space needed. The sorting of all results takes 0{a^{m + n)log(a^(m + 
n))) time. (Since we are interested in some of the best results only we actually 
do not have to sort all results, but sorting does not affect the asymptotical 
running time.) So altogether the algorithm needs 0{a^mn + a^{m + n)) time 
and 0{a^{m + n)) space. 

4 Test Results 

4.1 Implementation 

The crossdating algorithm has been implemented in C++ in a command-line- 
oriented Unix environment. A program executable can be obtained from the 
author. 

In practice a crossdating program outputs several good matchings (e.g. the 
best 5 or 10), and the dendrochronologist visually checks if one of them represents 
the correct dating. Likewise the program we have implemented allows the user 
to subsequently evaluate the results according to different criteria. That is, once 
the results have been computed, the best results in a certain time interval, those 
having a bigger minimum overlap or those for a lower value for a can be queried. 
The computation time of such modified result queries by scanning the list of the 
sorted results is linear in the number of results, thus 0{a^{m + n)). Note that 
our program especially computes all those t-values that are computed during 
the simple sliding algorithm. They can be accessed by querying the results for 
a = 0. Our program therefore generalizes the simple sliding algorithm. 

Several tests have been performed which we will present in the following 
three paragraphs: Tests with randomly generated missing or double rings, tests 
on data containing real missing rings and finally runtime tests. However, the 
automatized tests do not cover the interactive program properties. 



4.2 Randomly Generated Distnrbances 

The program was tested on collections of already dated samples. In each col- 
lection one sample was randomly disturbed by deleting or splitting up some 
values, and this sample was then tried to date against the mean sequence of the 
remaining sequences in the collection. 

Tests were performed mainly for the parameter values a = 2, a minimum 
overlap of 50 and a redundancy threshold of 10 (see end of Par. 12.31) . For each 
sample a random disturbance has been carried out 5 times. Some test results 
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are shown in Tables 0 13 E| 0 and 0 Column date shows the percentage of those 
data sets in the collection for which the correct dating has been found. Date & 
edit shows the percentage of those data sets for which the correct dating and the 
correct type of editation in an interval of radius 10 around the correct position 
have been found. The column k shows the average number of proposed edit 
operations for those results for which the correct date and editation has been 
found. For standardization the percentage of a five-year running mean (also 
called floating average; in the following tables abbreviated with float-ave) or the 
difference of logarithms (abbreviated with log) were used. 



Table 1. Test results without manipulation of the data. 





date 


date 


k 


date 


date 


k 




a = 0 


0 = 2 


a = 0 


a = 2 


kieftest 


96 % 


80 % 


0.3 


96 % 


86 % 


0.5 


germOOl 


98 % 


80 % 


0.2 


98 % 


83 % 


0.2 


germOOS 


100 % 


94 % 


0.1 


100 % 


88 % 


0.1 


germ004 


96 % 


79 % 


0.7 


96 % 


88 % 


0.9 


germ006 


100 % 


88 % 


0.4 


100 % 


94 % 


0.0 


canaOSO 


94 % 


72 % 


0.3 


94 % 


89 % 


0.0 


az052 


100 % 


80 % 


0.1 


100 % 


80 % 


0.4 


az526 


100 % 


100 % 


0.1 


100 % 


95 % 


0.8 


SETOl 


94 % 


69 % 


0.7 


94 % 


63 % 


0.5 


SET02 


96 % 


71 % 


0.8 


96 % 


63 % 


0.5 


oh004 


100 % 


72 % 


0.1 


100 % 


79 % 


0.1 


swed302 


98 % 


82 % 


0.0 


98 % 


89 % 


0.1 
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The data used was supplied by the following sources: The kieftest data is 
tree ring width data from a German pine, which was supplied by Deutsches 
Archaologisches Instituifl. The SETOl and SET02 data are files which come with 
the crossdating program TSAP 0 (which does not search for missing or double 
rings during the crossdating). The other data was taken from the ITRDI^, where 
the germ data sets are from German oaks, the cana data from Ganadian white 
spruce, the az data from Arizona where az526 is ponderosa pine, the oh data 
from white oak from Ohio and the swed data from scotch pine from Sweden. 

Tabledshows how many samples of each tested collection the algorithm dates 
correctly when no random disturbances have been performed. With a = 0 this 
equals a simple sliding algorithm concerning t-values which dates 97% samples 
correctly on the averag^. With a = 2 the percentage of correct datings decreases 

^ We like to thank Dr. K.-U. Heufiner from Deutsches Archaologisches Institut, 
Eurasien-Abteilung, Im Dol 2-6, D-14195 Berlin. 

® The International Tree-Ring Data Bank (ITRDB) is located at 
http:/ /www. ngdc.noaa.gov/paleo/treering.html. 

^ On the average means here averaged over the tested collections shown in the tables. 
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by up to 33% (on the average only by 15% though), and the number of mistakenly 
found edit operations increases by up to 0.9 (on the average only by 0.4). 

To see how our algorithm performs on sequences that contain missing rings, 
take a look at Table 0 which shows test results for the germOOl data with one 
randomly deleted element and for different values of a. Setting a = 0 (again, 
this equals the simple sliding algorithm) the correct date is found in only 51% or 
41% of the cases (for floating-average or log preprocessing, respectively). When 
allowing the algorithm to perform some editations by choosing a slightly greater 
than 1, the chance for a correct dating increases dramatically. But the farther a 
is away from the correct number of edit operations needed, the more false edit 
operations are performed and the more the chance for a correct date decreases. 
But since there are not many double or missing rings expected in a tree ring 
sequence, a small value of a (e.g. 2 or 3) should be sufficient most of the times. 



Table 2. Test results concerning the germOOl data with a random deletion of 
one sample element and different values for a. 



a 


date 


date & 
edit 


k 


date 


date & 
edit 


k 


0 


51 % 


0 % 


0.0 


41 % 


0 % 


0.0 


1 


94 % 


91 % 


1.0 


95 % 


91 % 


1.0 


2 


93 % 


89 % 


1.0 


92 % 


87 % 


1.0 


3 


81 % 


77 % 


1.4 


84 % 


76 % 


1.4 


4 


82 % 


78 % 


1.5 


84 % 


76 % 


1.4 


5 


85 % 


78 % 


2.2 


85 % 


76 % 


1.9 



Float-ave preprocessing Log preprocessing 



Tables 0 0 and 0 show test results for different data collections with one 
random deletion, one random splitting and two random consecutive deletions. 
Although the percentages for a correct dating with a correct editation vary 
from 38% to 95%, the percentages are usually extremely higher than those for 
a dating with a = 0. However, as can be seen in the column date for a > 0, 
the program finds the correct date more often than the correct date plus the 
correct editation, because it proposes some wrong, additional or not enough edit 
operations. In fact, if the program finds the correct date, it usually proposes most 
of the editations at an almost correct position and skips necessary editations only 
if they are too close to the beginning or the end of the sequence. In any case 
the results of the program give more information to the user about possible 
missing or double rings than the standard crossdating methods do. Concerning 
two consecutive deletions. Table El shows that if two missing rings have been 
found, they lie only about 2 or 3 years apart. 
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Table 3. Test results with a random deletion of one sample element. 





date 


date 


date & 
edit 


k 


date 


date 


date & 
edit 


fc 




a = 0 


a = 2 


a = 0 


a = 2 


kieftest 


24 % 


73 % 


67 % 


m 


28 % 


76 % 


69 % 


ES 


germOOl 


51 % 


92 % 


89 % 


ntii 


41 % 




87 % 


IQ 


germOOS 


58 % 


94 % 


83 % 


IQ 


53 % 


92 % 


81 % 


IQ 


germ004 


47 % 


84 % 


76 % 


m 


50 % 




78 % 


IQ 


germOOG 


21 % 


91 % 


85 % 


ntii 


26 % 




81 % 


IQ 


canaOSO 


27 % 


72 % 


69 % 


IQ 


30 % 




80 % 


IQ 


az052 


44 % 


97 % 


94 % 


IQ 


42 % 




95 % 


IQ 


az526 


38 % 


91 % 


79 % 


iHil 


35 % 




83 % 


EB 


SETOl 


34 % 




54 % 


IQ 


25 % 


59 % 


49 % 


EB 


SET02 


37 % 


54 % 


48 % 


m 


33 % 




53 % 


EB 


oh004 


47 % 


85 % 


83 % 


IQ 


42 % 




80 % 


IQ 


swed302 


43 % 


83 % 


72 % 


IQ 


39 % 




79 % 


IQ 
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Table 4. Test results with a random splitting of one sample element. 





date 


date 


date & 
edit 


fc 


date 


date 


date & 
edit 


fe 




a = 0 


a = 2 


0 = 0 


a = 2 


kieftest 


22 % 


75 % 


63 % 


EB 


28 % 


75 % 


58 % 


EB 


germOOl 


52 % 


92 % 


88 % 


IQ 


42 % 




88 % 


IQ 


germ003 


56 % 


94 % 


83 % 


IQ 


52 % 


92 % 


79 % 


IQ 


germ004 


48 % 




78 % 


ED 


50 % 




74 % 


ED 


germ006 


19 % 


HQ 


81 % 


IQ 


26 % 




85 % 


IQ 


cana030 


29 % 


72 % 


66 % 


IQ 


34 % 




71 % 


IQ 


az052 


47 % 


1^ 




IQ 


42 % 




80 % 


IQ 


az526 


36 % 


83 % 


74 % 


IQ 


34 % 




66 % 


EB 


SETOl 


18 % 


HQ 


53 % 


IQ 


16 % 


ssm 


56 % 


EB 


SET02 


28 % 


57% 


48 % 


ED 


32 % 


59 % 


52 % 


BE 


oh004 


47 % 


87 % 


85 % 


IQ 


43 % 




83 % 


IQ 


swed302 


34 % 


74 % 


67 % 


ED 


35 % 


79 % 


69 % 


IQ 
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4.3 Real Missing Rings 

In this paragraph we present results from tests performed on data containing real 
missing rings. Test data like this is widely available because many dendrochro- 
nologists mark a missing ring as a ring with width 0. Test data for double rings 
is rather hard to find because dendrochronologists usually do not record the 
occurence of double rings. The reason for that is that there is a chance to visu- 
ally identify a double ring on the wood (e.g. after some more preparation of the 
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Table 5. Test results with a random deletion of two consecutive sample ele- 
ments. 





date 


date 


date & 
edit 


k 


distance 
betw. edits 


date 


date 


date & 
edit 


k 


distance 
betw. edits 




a = 0 


a = 3 


a = 0 


0 = 3 


kieftest 


13 % 


67 % 


55 % 


2.2 


2.7 


21 % 


73 % 


56 % 


2.1 


2.5 


germOOl 


56 % 


91 % 


85 % 


2.0 


2.2 


54 % 


89 % 


78 % 


2.0 


2.2 


germOOS 


54 % 


91 % 


74 % 


2.0 


3.5 


58 % 


89 % 


68 % 


2.0 


3.5 


germ004 


38 % 


80 % 


79 % 


2.1 


3.1 


49 % 


85 % 


69 % 


2.1 


2.7 


germ006 


16 % 


85 % 


75 % 


2.0 


2.5 


21 % 


84 % 


66 % 


2.0 


2.5 


canaOSO 


20 % 


71 % 


66 % 


2.0 


2.8 


34 % 


86 % 


73 % 


2.0 


3.0 


az052 


44 % 


94 % 


91 % 


2.0 


1.8 


42 % 


91 % 


88 % 


2.0 


1.7 


az526 


44 % 


89 % 


79 % 


2.0 


2.2 


48 % 


92 % 


78 % 


2.1 


2.3 


SETOl 


24 % 


55 % 


49 % 


2.0 


2.8 


34 % 


55 % 


39 % 


2.2 


3.1 


SET02 


30 % 


51 % 


39 % 


2.1 


3.1 


38 % 


52 % 


38 % 


2.1 


2.4 


oh004 


45 % 


79 % 


76 % 


2.0 


2.3 


49 % 


76 % 


72 % 


2.0 


2.2 


swed302 


34 % 


77 % 


65 % 


2.0 


2.7 


48 % 


81 % 


65 % 


2.0 


2.6 
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wood or using a better microscope), whereas for a missing ring there is not. We 
therefore restricted the tests on data with real inconsistencies to data containing 
missing rings. 



Table 6. Test results for data with real missing rings; a = 4. 





#of 

samples 


ave. # of 
miss, rings 


date 


date & 
edit 


date 


date & 
edit 


wa067 


19 


2.42 


16 84% 


12 63% 


17 Si 89% 


14 Si 74% 


wa069 


13 


1.08 


12 92% 


9 69% 


13 S^ 100% 


10 Si 77% 


wa072 


19 


2.32 


16 84% 


13 Si 68% 


16 Si 84% 


13 S^ 68% 


wa079 


23 


2.22 


17 ^ 74% 


15 Si 65% 


18 Si 78% 


17 Si 74% 


breclav 


7 


1.0 


7 100% 


5 Si 71% 


7 Si 100% 


6 Si 86% 


chin04 


13 


2.92 


12 92% 


9 S^ 69% 


13 Si 100% 


11 Si 85% 


az052 


6 


3.33 


5 83% 


4 Si 67% 


5 Si 83% 


4 Si 67% 


az526 


14 


3.71 


13 93% 


6 Si 43% 


13 Si 93% 


7 s^ 50% 
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Table 0 shows test results for collections of samples where some samples 
contain missing rings. During the tests a has been chosen to be 4. The hreclav 
data was supplied by Deutsches Archaologisches Institut, the others are available 
at the ITRDB. The wa data is from a subalpine larch from Washington State, 
the chin data is Armand’s pine from China, and the az data is from Arizona, 
as mentioned above. The column # of samples contains the number of samples 
in the collection that contain missing rings. The ave. # of miss, rings column 
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shows the average number of missing rings contained in the samples. The data 
& edit column contains the number of samples (and also the percentage relative 
to the # of samples) that the algorithm dates correctly with the correct number 
and position (with a tolerance of 10) of insertions. The master sequences to date 
against are built out of those samples in each collection that do not contain 
missing rings. 



4.4 Runtime Tests 

The crossdating algorithm has been tested on a Sparc Ultra 1 machine. The 
runtime the program needs for fixed a = 2 is illustrated in Fig. |S| and for a 
fixed sample length n = 300 in Fig. 0. For a typical input consisting of a sample 
of length n = 300, a master of length m = 1000 and a = 2 the program needs 
about 13 seconds. 



alpha=2 




Fig. 8. Runtime tests with a = 2 and n and m the lengths of the sample and 
master sequence, respectively. 



5 Implementation and Conclusions 

We investigated the problem of matching tree ring width sequences (crossdating) 
which is stated in dendrochronology. Assuming that a tree forms exactly one 
ring each year the matching can be performed by an easy 6(mn) algorithm. 
We presented a 0{a^mn + a'^{m + n)) crossdating algorithm which takes the 
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n=300 




Fig. 9. Runtime tests with n = 300 fixed, where n and m are the lengths of the 
sample and master sequence, respectively. 

possibility of missing and double rings into account by employing an edit distance 
as a distance measure. 

The algorithm has been implemented and tested. The tests show that the 
dating quality of the algorithm varies depending strongly on the input data. It 
is best when a equals the number of inconsistencies to be found. It is therefore 
not possible to date a tree ring sequence according solely to the first dating 
proposition of the algorithm. However, in the usual dating process results of an 
automatic matching are taken as dating propositions only and are always visually 
verified by a dendrochronologist. Since our program allows the evaluation of the 
computed results for some different parameter settings, e.g. for a lower value for 
a, in rather fast 0{a‘^{m + n)) time, and it usually offers some good datings, 
the program should be eligible to serve as an additional dating tool searching 
for missing and double rings. 

For further research it would be interesting to investigate whether it is pos- 
sible to compare several tree ring sequences at once in order to produce a mean 
sequence {master sequence, chronology). The question could also be raised as 
to whether similar matching techniques based on the edit distance can be ap- 
plied to other environmental archives like sea or glacier sediments, where certain 
environmental events produce different distortions of the underlying sequences. 
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Abstract. We address the problem of approximate string matching in 
d dimensions, that is, to find a pattern of size m'^ in a text of size 
with at most k < m!^ errors (substitutions, insertions and deletions 
along any dimension). We use a novel and very flexible error model, 
for which there exists only an algorithm to evaluate the similarity be- 
tween two elements in two dimensions at O(m^) time. We extend the 
algorithm to d dimensions, at 0{d\m?'^) time and 0{d\m?'^~^) space. We 
also give the first search algorithm for such model, which is 0{d\m‘^'nf) 
time and 0{d\m!^n'^~^) space. We show how to reduce the space cost 
to with little time penalty. Finally, we present the first 

sublinear-time (on average) searching algorithm (i.e. not all text cells are 
inspected), which is 0{kn ‘^ for k < (m/(d(log^ m — log^ d)))'^“^, 
where cr is the alphabet size. After that error level the filter still re- 
mains better than dynamic programming for k < /{d{\og^ m — 

log^ These are the first search algorithms for the problem. As 

side-effects we extend to d dimensions an already proposed algorithm for 
two-dimensional exact string matching, and we obtain a sublinear-time 
filter to search in d dimensions allowing k mismatches. 



1 Introduction 

Approximate pattern matching is the problem of finding a pattern in a text 
allowing errors (insertions, deletions, substitutions) of characters. A number of 
important problems related to string processing lead to algorithms for approx- 
imate string matching: text searching, pattern recognition, computational biol- 
ogy, audio processing, etc. Two dimensional pattern matching with errors has 
applications, for instance, in computer vision (i.e. searching a subimage inside 
a large image). In three dimensions, our algorithms may be useful for searching 
allowing errors in video data (where the time would be the third dimension) or 
in some types of medical data (e.g. MRI brain scans). 

For one dimension this problem is well-known, and is modeled using the 
edit distance. The edit distance between two strings a and b, ed{a,b), is defined 
as the minimum number of edit operations that must be carried out to make 

* Supported in part by Fondecyt grant 1-990627. 
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them equal. The allowed operations are insertion, deletion and substitution of 
characters in a or b. The problem of approximate string matching is defined as 
follows: given a text of length n, and a pattern of length m, both being sequences 
over an alphabet E of size cr, find all segments (or “occurrences”) in text whose 
edit distance to pattern is at most k, where 0 < k < m. The classical solution is 
0{mn) time and involves dynamic programming 1201 . 

Krithivasan and Sitalakshmi (KS) 1171 proposed a simple extension to two 
dimensions. Given two images of the same size, the edit distance is the sum of the 
edit distance of the corresponding row images. This definition is justified when 
the images are transmitted row by row and there are not too many communica- 
tion errors (e.g. photocopy images, where most errors come from the mechanical 
traction mechanism along one dimension only, or images transmitted by fax), 
but it is not appropriate otherwise. Using this model they define an approximate 
search problem where a subimage of size m x m is searched into a large image 
of size n X n, which they solve in 0{m^n^) time using a generalization of the 
classical one-dimensional algorithm. 

In [n|, Baeza-Yates (BY) defined a more general extension (there called RC), 
where the errors can occur along rows or columns at any time. This model is 
much more robust and useful for more applications. We are interested in this 
general model in this work. Figure Q gives an example. 





Fig. 1. Alternative error models. 



Although in |S| they give an 0{rn^) time algorithm to compute the edit 
distance among two images of size m x m, they do not give any algorithm to 
search a subimage inside a larger image allowing errors. 

In this work, we first generalize the edit distance algorithm to d dimen- 
sions with complexity We then give an 0{dlm‘^n‘^) time algorithm for 

the search problem, matching the same complexity of the simpler KS model 
in two dimensions, and show how to reduce the space requirements so that 
they depend only on the pattern size. We also give a new filtering algorithm 
that allows to quickly discard large parts of the text that cannot contain a 
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match. This algorithm searches the pattern in average time 0{kn '^ for 
k < (m/ (d(log^ m — log^. where cr is the alphabet size. After that error 

level the filter changes its cost but remains better than dynamic programming for 
k < m'^~^ / {d{log^ m — log^ These are the first searching algorithms 

for this problem. 

Two side-effects are obtained as well. First, we generalize to d dimensions 
and analyze a previously proposed algorithm to search in two dimensions not 
allowing errors. Second, we obtain a filter to search a pattern in d dimensions 
allowing up to k character substitutions. 

2 Previous Work 

The classical dynamic programming algorithm m to search a pattern in a text 
allowing errors uses dynamic programming and is 0{mn) time and 0(m) space. 

This solution was later improved by a number of algorithms, which we do not 
cover here. The only one of interest to this work is a filtering algorithm EIIIH1|7|. 
It states that if a pattern is cut in A: -I- 1 pieces, then any occurrence with up to 
k errors must contain one of the pieces unchanged. This is obvious since k errors 
cannot alter the fc -I- 1 pieces given the edit operations that we consider (which 
cannot alter two pieces at the same time). The algorithm simply scans the text 
using a multipattern exact search algorithm for all the pieces. Each time a piece 
is found, it uses dynamic programming over an area of length m + 2k where the 
approximate occurrence can be found. 

The multipattern search can be carried out in 0(n) worst-case search time by 
using an Aho-Corasick machine or in 0(njm) best-case time using Commentz- 
Walter ^2] or another Boyer-Moore type algorithm adapted to multipattern 
search. The total cost of verifications keeps below 0{n) if k/m < 1/(3 log^, m). 

Two dimensional string matching was first considered by Bird and Baker 
p,nni, who obtain O(n^) worst-case time. Good average results are presented 
by Zhu and Takaoka in The best average case result is due to Baeza-Yates 
and Regnier |2|, who obtain 0{r? jm) time on average and 0{n^) in the worst 
case. 

The case of two dimensional approximate string matching usually consid- 
ers only substitutions for rectangular patterns, which is much simpler than the 
general case with insertions and deletions. For substitutions, the pattern shape 
matches the same shape in the text (e.g. if the pattern is a rectangle, it matches 
a rectangle of the same size in the text). For insertions and deletions, instead, 
rows and/or columns of the pattern can match pieces of the text of different 
length. Under the substitutions model, one of the best results on the worst case 
is due to Amir and Landau P|, which achieves 0{{k -I- log cr)n^) time but uses 
0{n^) space. A similar algorithm is presented in ^3- Ranka and Hey wood solve 
the same problem in 0{{k + m)n^) time and 0{kn) space. Amir and Landau also 
present a different algorithm running in 0{n? log n log log n log m) time. On aver- 
age, the best algorithm is due to Karkkainen and Ukkonen with its analysis 
and space usage improved by Park ng. The expected time is 0{'n?k/m'^ log^ m) 
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for k < m^/(41og^ m) using O(m^) space (0{k) space on average). This time 
result is optimal for the expected case. 

Krithivasan and Sitalakshmi (KS) [E| defined the edit distance in two di- 
mensions as the sum of the edit distance of the corresponding row images. Using 
this model they search a subimage of size m x m into a large image of size 
n X n, in 0{m^n^) time using a generalization of the classical one-dimensional 
algorithm. Krithivasan m presents for the same model an 0{m{k + log m)n^) 
algorithm that uses 0{mn) space. Amir and Landau 0 give an 0{k^n^) worst 
case time algorithm using O(n^) space. Amir and Farach |3 also considered 
non-rectangular patterns achieving 0{k{k + ^/mlog m^/k log k)v?) time. 

In |S1 we use the same model and improve the expected case to 0{n?k log^ ml 
TO^) on average for k < m{m + l)/(51og^m), using 0{m?) space. This time 
matches the optimal result allowing only substitutions, and is also optimal 
being the restriction on k only a bit stricter. For higher error levels, 0 presents 
an algorithm with time complexity O k/{y/a log n)), which works for k < 
m(m + 1){1 — ej y/a). It is also shown that this limit on k cannot be improved. 

In 13, Baeza-Yates defined more general models, where the errors can occur 
along rows or columns. Three distances R, C and L are defined, and for the first 
two it is shown that the filters of |3 can be applied to obtain the same complexity 
and slightly reduced tolerance to errors, i.e. k < m{m + l)/(71ogg. m). A fourth 
model defined in |3 is called RC, which generalizes R and C since the errors 
can occur along rows or columns at any time. This model is much more robust 
and useful for more applications, and is the one we use in this work. We cover 
this model in detail in the next section. 



3 Multidimensional Approximate Searching 

The classical dynamic programming algorithm m to compute the edit distance 
between two one-dimensional strings A and B of length mi and m 2 computes 
a matrix Co. . mi, 0 .. m 2 - The value Cij holds the edit distance between Ai..i and 
Bi .j. The construction algorithm is as follows 

Ci,0 ^ ^ 7 ^0,j ^ j 

Cij <— if Aj = Bj then Ci-ij-i else 1 -I- min((7i_ij_i, Q-ij, 

and the distance ed{A,B) is the final value of Cmi,m 2 - The rationale of the 
formula is that if Ai = Bj then the cost to convert Ai .i into Bi. j is that of 
converting into Bi. j-i. Otherwise we have to make one error and select 

among three choices: (a) convert into Bi. j-i and replace Ai by Bj, (6) 

convert into Bi. j and delete Ai, and (c) convert Ai. i into Bi. j-i and 

insert Bj. 

This algorithm takes 0{mim2) space and time. It is easily adapted to search 
a pattern P in a text T allowing up to k errors IZDl. In this case we want to 
report all the text positions j such that a suffix of Ti..j matches P with at most 
k errors. This time the matrix is Co..n,o..m and the construction formula is 
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Ci,0 ^ 0 , Cqj <— j 

Cij ^ if Pi = Tj then Ci-ij-i else 1 -I- Ci-ij,Cij-i) 

where the only change is that a pattern of length zero matches with no errors 
at any text position. All the positions i such that Ci^m "£ k are reported. This 
takes 0{mn) time. The space can be reduced to 0{m) by noticing that only 
the old and new column of the matrix need to be stored. We define led{T, P) as 
the smallest edit distance among the pattern P and a suffix of T, and therefore 
led{Ti,,i,P) = Ci^rn- 

In j2j, a natural extension to the edit distance notion for two dimensional 
strings (or “images”) A and B was defined (called RC in that paper, and ed 2 
in this work). It allows the errors to occur along any dimension. An algorithm 
to compute the edit distance among two images is defined. For simplicity we 
assume that they are square and of the same size m x m, although it is easy 
to remove that limitation. The algorithm computes a four-dimensional matrix 
C*o..m,o..m,o..m,o..mj SO that Ci^j^p^q — ^d(^A\ i \ j ^ B\ p \ . C is built using the 

formulas 

Ci, 0 . 0,0 i ) C'oj.o.o ^ j 
Co,o,p,o ^ P ) C'o,o,o ,9 ^ 9 

^ min( Ci—l^j^p—l^q cd(^Ai l j^Bp l q'j^ Ci—l j p q J, Ci j p—\ q -\- 

CiJ—l^p^q—1 A cd(Al B\ p q) Ci j — l p q i, Ci j p q—\ “t" P ) 

which has a very similar rationale of the one-dimensional case: at each point we 
can solve the last row (first line of the min() formula) or the last column (second 
line of the min() formula). In each case, we either insert the whole row, delete 
the whole row, or replace the row of A by the row of B (and ed() gives the 
best way to do it). This algorithm is 0{m^) time and 0{rrA) space. However, by 
precomputing all the values 

UoviZi^j^p^q — Gd(^Ai l j^Bp l q^ — ^d(^A\ i j ^ B\ p q^ 

(i.e. all the row- wise and column- wise alignments), the search time drops to 
O(m^) and the space does not change. This is because the ed{) of the C formula 
are obtained in constant time, and Horiz consists of one-dimensional edit 
distance computations, among Ai^^ and Bp^^. The same holds for Vert. 

The space can also be reduced to 0{m^), as shown in [S|. We select, say, i as 
the most external variable of the iteration to fill the matrix. Therefore, we need 
only the values at iteration j — 1 to compute the values at iteration i. Hence, 
we do not need to store all the cells of all the i-th iterations, just the last one. 
The same can be done with Horiz and Vert, by using i as the most external 
iteration variable. 

In mi they mention that this algorithm extends to d dimensions in time 
but they do not give the details. We give a detailed algorithm in the 
next section and show that the exact complexity is 0{dlrrV‘^). Also, no algo- 
rithm was given in |S| to search a subimage in a larger image using the above 
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distance function. We do so in the following sections. We finally extend the 
one-dimensional filtering algorithm to more dimensions. 

4 Edit Distance in More Dimensions 

The idea of the previous section can be extended to compute edd{), i.e. the 
edit distance generalized to d dimensions. The algorithm is 0{dlm^‘^) time and 
space. 

A (2d)-dimensional matrix C is computed (d dimensions for A and d di- 
mensions for B), and the ed{) of the above formula is replaced by edd-i- If the 
values of edd-i are not precomputed then we have space (by using 

the trick of selecting one variable as the most external in the iteration) plus the 
space needed to compute edd-i (only one at a time is computed). This gives the 
recurrence 

Si = m , Sd = -I- Sd-i 

which yields space. The time, on the other hand, involves to fill 

cells, where each cell performs a minimum over 3d elements (i.e. insertion, dele- 
tion and edd-i in d dimensions). This makes it necessary to compute d times 
the function edd-i{)- That is 

Ti = , Td = 3d -H d Td-i 

which yields This matches the 0{m^) result for two dimensions 

mentioned in 0. 

However, we may precompute all the necessary values of edd-i{)- Along each 
one of the d dimensions, we take all the m? (i,p) possible combinations of values 
of the selected dimension in A and B, and compute edd-i{) between the (d— 1)- 
dimensional objects which result from restricting the selected dimension to i in 
A and to j in B. Once this is done, the edd-i computations can be taken as 
constants in the formula of edd{)- The time cost is now 

Ti = , Td = rr?'^ 3d -I- drn?Td-i 

which yields 0{d\rn?'^) time (which matches the improved 0{rrA) algorithm of 
0 for two dimensions). This is a big improvement over the naive algorithm. The 
space requirements are, however, higher. We have to store, for the d-dimensional 
object, cells plus the precomputed values, along each dimension, of all the 
vn? combinations of {i,p) values for that dimension, and all the space for the 
lower dimensions resulting for each pair (i,p). That is 

Si = m , Sd = + dm^Sd-i 

which yields 

Sd = (n + ^ + - + ^) ^ = 0{d\m^^) 

and we can use the trick of the external variable to reduce this to 0{dlm?'^~^). 
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5 A Dynamic Programming Search Algorithm 

We modify the edit distance algorithm so that instead of computing the edit 
distance between two elements, it searches a small pattern P of size m'^ inside 
a large text T of size n’^. The idea is a simple modification of the edit distance 
algorithm. For two dimensions the formula is as follows 



Q, 0 , 0,0 ' 


- 0 , 


I C'o,j,o,o ^ 0 




C'o,o,p,o 


— P : 


1 (^0,0, 0,9 ^ Q 






— min 


( Ci-ij^p-i^q + led{Ai^ 








C'i,j-i,p,9-i + led{A\ 


..i,j j — p, CiJ^p^q —1 P ) 



where the only differences are that the basic values are zero when the pattern is 
of size zero, that we penalize insertions and deletions according to the pattern 
size, and that instead of ed{) we use led{), so that we select the best suffix of 
the text along each dimension. If we are searching allowing up to k errors, then 
we report all text (i,j) positions such that Cij^m,m < k. 

The form to extend this to more dimensions is immediate. By repeating the 

_ . . . s d(d + l) , 

analysis of the above section, we see that the naive algorithm is 0[d\{mn) 2 ) 

time and space (since n is much larger than m, we select one of the 

text coordinates as the most external variable). By precomputing the distances 
in lower dimensions, the search algorithm is 0{d\m'^n^) time and 0{dlm‘^n‘^~^) 
space. 

5.1 Correctness 

We now prove that the above algorithm is correct (in two dimensions). This 
extends easily to more dimensions. 

Lemma: For each text position (z,j), it is possible to perform edit 

operations in the pattern P (converting it into P') so that the pattern P' matches 
the text suffix T..i, ..j, and this is not possible with less operations. 

Proof: We prove the Lemma for any Cij^p^q. The Lemma is obviously true 
for the base case of the formula. For the recursive case, we inductively assume 
that the Lemma is true for the subproblems. We consider the first line of the 
update formula, which corresponds to the rows (the other cases are equivalent). 

If the value for Cij^p^q is obtained using a row insertion in the pattern, then 
we can inductively align Pi..p,i,,g at with cost Ci-ij^p^q, and then insert 

the text segment j in P at the cost of p more errors so as to align 

Pi..p,i..g ^t T i j . 

If the value for C\j^p^q is obtained using a row deletion in the pattern, then 
we can inductively align Pi..p-i.i..q at T.jj- with cost Cij^p-i^q, and then delete 
the pattern row Pp,i..q from P at the cost of p more errors so as to align Pi..p,i,,g 
at T.ij-. 

Finally, if we obtain Cij^p^q by replacing Ppp..q with a row suffix of 
then the led{) of the formula gives the optimal way to do it, so that we align 
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Pi..p-i,i..q at with cost and then convert the pattern row 

Pp,i..q to some text row suffix of at led(Ti^i,,j, Pp^i,,q) cost. 

Alternatively, we can use the recursion on the column values. It is also clear 
that this cannot be done better. On the other hand, we can use induction over the 
number of dimensions to show that the Lemma is correct for any d-dimensional 
problem. 

5.2 Reducing the Space Requirements 

The space requirement of the algorithm is which is too high. This 

is awkward since the problem exhibits high locality. That is, the fact that a text 
position matches or not depends only on the last {m + k)‘^-size text “suffix” that 
ends at that point. In fact, if A: > m we just need to start 2m positions behind 
the subtext at each dimension, since if more than m errors are made along a 
given line, it is better to just perform m replacements. 

Therefore, if we cut the text in {n/ sY subtexts (of d dimensions) of size s'^, we 
can work separately at each subtext provided we start, at each dimension, m + 
min(m, k) positions behind the cube so as to have the context properly initialized 
when we reach the cube. The total time is (n/s)‘^d!m‘^(m + min(TO, k) + s)^, and 
the total space is dlm‘^{m + min(m, k) + sY~^. 

For instance, we may select s = m, and then we obtain an algorithm which 
is at most 0{dl3‘^m‘^n'^) time and space (and less if /c < to), 

which is much more reasonable. The minimum possible space requirement is 
0((i!2‘^TO^^“^), at time cost 0{d\2'^m?'^nY (that is, s = 1). 

6 Multidimensional Exact String Matching 

In PI, they allow to search, in two dimensions, a pattern in a text in 0{'n? jm) 
average time. They traverse only the text rows of the form i x m searching for 
all the pattern rows at the same time (using Aho-Corasick Q)j verify all 
potential matches. Clearly, no match can be missed with the filter. 

In m, the authors briefly mention that their technique can be extended to 
more dimensions by selecting one dimension and recursively using an algorithm 
for (d— 1) dimensions on the TO-th “rows” of such text. However no more details 
are given, nor any analysis. 

We give now a more detailed version of the algorithm and analyze it. We select 
one dimension (say, coordinate i) and obtain n/m different (d — 1) dimensional 
objects of the form and so on. On 

the other hand, we obtain to patterns of (d— 1) dimensions, namely Ti,i,,m,i..m,..., 
P2,i..m,i..m,..., Pp,i..m,i..m,... and SO On. All the TO subpatterns are searched 

in each one of the (d — 1) dimensional subtexts. See Figure 0 Each time one 
of the (d — 1) dimensional subpatterns is found in a text position, the complete 
d-dimensional pattern is checked. 

An important part of the analysis of |S| for two dimensions is that the total 
cost to verify potential matches is not too large. It is not immediate that this 
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2-d pattern 




2-d text 




Fig. 2. Algorithm for exact searching. All the pattern “rows” are searched in 
njm, text “rows” at the same time. 



is still valid for more dimensions, since a very large number of verifications are 
finally triggered. 

The cost to verify a potential match in d dimensions is always 0(1) on aver- 
age, since we have to check if rrfi letters of the pattern are equal to the text at 
a given position. Since we stop the checking as soon as we find a mismatch, we 
verify more than c characters with probability l/cr'^. Hence, the average number 
of characters checked is ^ 1 /(t° = 0(1) (even for patterns of unbounded size). 

We denote by Ed^r the average search cost for r patterns in d dimensions. 
The existence of the Aho-Corasick P algorithm implies that Ei^r = n. Now, for 
d dimensions, we perform n/m searches for rm patterns on d— 1 dimensions, and 
check all the candidates that occur. The probability of a pattern of size m‘^~^ 
occurring in a text position is 1/cr™ , but we multiply that by rm because 

we search for rm different patterns. As the average cost to verify each potential 
match is 0(1), and the {d — 1) dimensional texts are of size we have that 



Ed^r — 
which gives 

Ed,r — 



[Ed-l,rm + n ) 

m V (T™ / 



n n'^r 

Ed-1, rm H d -1 

m cr"* 



-td—1 



d-1 

E: 

w—1 



= O n' 



(where the first term corresponds to the actual searches which are all done in 
one dimension). 

To search for one pattern we replace r by 1 in this final formula (although the 
algorithm internally uses multipattern search) . This formula matches the result 
for two dimensions, since l/tr™ = o(l/m). In general, if d is considered fixed, 
the above result for r = 1 can be bounded by 0{n'^ /m‘^~^). 

The space complexity of the algorithm corresponds to the Aho-Corasick ma- 
chine, whose space requirements are proportional to the total size of all the 
patterns, i.e. 0{rm‘^). We use now this algorithm as a building block. 
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1 dimension 



2 dimensions 




3 dimensions 




Fig. 3. Filtering algorithm for j = 3. The maximum possible k so that some 
block appears unchanged is 2, 2, and 8 as the dimension grows. 



7 A Fast Filter for Multidimensional Approximate Search 

We present now an effective filter to quickly discard large parts of the text which 
cannot contain a match, so that we use the dynamic programming algorithm to 
verify only the text areas which could contain an occurrence of the pattern. 

The filter is based on a generalization of the one-dimensional filter explained 
in Section 121 In that case, we cut the pattern in (fc -I- 1) pieces, and since each 
error can destroy at most one piece, we have always one piece left untouched 
inside each occurrence. 

In two and more dimensions, we cut the pattern in j pieces along each di- 
mension, for some I < j < m (see Figure El). Since each error occurs along one 
dimension only, at most kj pieces are destroyed. Therefore, since there are j'^ 
pieces in total, it is enough that j'^ > kj to ensure that at least one of the 
pieces is left untouched (although we do not know which one). Hence, we search 
for all the pieces at the same time in the text without allowing errors. Those 
pieces are of size and can be searched with the algorithm of the previous 

section in 0{m‘^) space and an average time of 

( \ + 

Each time one such piece is found, we have to verify a surrounding text 
area to check for a possible match. This area extends (to -|- 2min(TO, A:)) po- 
sitions along each dimension (since the match could start at most min(TO, k) 
positions backward or finish up to min(TO, k) positions forward). Hence, the cost 
of a verification is the same as that of searching the pattern in a text of size 
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(rn -h 2mm{m,k))‘^ allowing errors, which is 0{d\mf‘{m -I- 2 min(m, fc))'^). The 
total number of verifications is obtained by multiplying the number of pattern 
pieces by the probability of a piece matching, i.e. . Hence, the total 

expected cost for verifications is + 2min(m, k))'^n'^ / . 

Notice that, since we only verify pieces of the text of size (m-|- 2 min(m, k)Y, 
the space requirement of this algorithm is 0{dlm‘^{m + 2 min(m, (this 

corresponds to the verification phase, since the search of the pieces needs much 
less, i.e. 0{mf^)). This is a form of our previous technique to reduce space re- 
quirements (recall Section 15.21) equivalent to using s = min(m, /c). However, in 
this case we only check a few portions of the text. 

Both the search and the verification cost worsen as j grows, so we are inter- 
ested in the minimum j that works. As said, we need that j‘^ > kj, hence 



j 




+ 1 



is the best choice. The formula does not work for one dimension (because it is 
not true that kj pieces are destroyed) , and for 2 dimensions it sets j = /c -I- 1 as 
in the traditional one-dimensional case. Notice that we need that j < m, and 
therefore the mechanism works for k < k^ = m'^~^ . Using this optimum (and 
minimum) j, the total cost of searching plus verifying is 

1 1 d!m'^(TO -I- 2 min(TO, A:))^^ 




which worsens as k grows. This search complexity has three terms, each of which 
dominates for a different range of k values. The first one dominates for 



,d-l 



k < ko = 



{dlog„ m 

while the second dominates from k > ko until 

,d-l 






k < ki = 



m 



{d{log„ d + 2log„ m)) ^ 



- (1 + 0 ( 1 )) 



In the maximum acceptable value k = m‘^~^ — 1, the search complexity be- 
comes 0{dl3'^m^‘^n'^) , which is worse than using dynamic programming. We want 
to know which is the k value for which the filter is better than dynamic program- 
ming. We can compare against the version that uses the same amount of space 
(which corresponds to s = min(77i, k)), whose time complexity is 0{d\2‘^m?^n'^)\ 
or we can compare it against the fastest version of dynamic programming, which 
needs much more space and whose time cost is 0{d\mf"n'^). In either case we have 
that the k range for which the filter is better than dynamic programming is 



k < k2 



m 



d-l 



(2dlog^m) 



d-l 

d 



(1 + 0 ( 1 )) 
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where the difference in the version of dynamic programming used affects lower 
order terms only. 

Finally, the most stringent condition we can ask to the filter is to be sub- 
linear, i.e. faster than If we try to consider the third term of the search 

complexity as dominant, we arrive to a fc value which is smaller than fci, which 
means that the solution is in a stricter k range. By considering the second term 
of the search complexity, we arrive to the condition k < ko. That is, the search 
time is sublinear precisely when the first term of the summation dominates. 

To summarize, the search algorithm is sublinear (i.e. 0{kn ‘^ for 
k < {m/{dlog^m))‘^~^ , and it improves over dynamic programming for k < 
m‘^~^ Figure 0 illustrates the result of the analysis. 




first term second term third term 

dominates dominates dominates 



Fig. 4. The complexity of the proposed filter, depending on k. 



7.1 A Stricter Filter 

We have assumed up to now that we verify the presence of the pattern allowing 
errors as soon as any of the pieces appears. However, we can do better. We 
know that j'^—jk pieces must appear, at their correct positions, for a match to be 
possible. Therefore, whenever a piece appears, we can check the neighborhood for 
the exact occurrences of other pieces. On average, the verification of each piece 
will fail in 0(1) character comparisons, and we will check 0{jk) pieces until jk 
of them fail the test (this is because both are geometric processes). Therefore, 
we have a preverification test which occurs with probability , costs 

0{jk) and is able to discard more text positions before actually verifying the 
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candidate area. The probability that a text position passes the preverification 
test and undergoes the dynamic programming verification can be computed by 
considering that — jk cells need to match, which means that m'^ — 
characters match. On the other hand, we can select as we want which jk cells 
match out of j'^, which multiplies the probability by (j^). Finally, if the text 
area passes this filter, we verify it at the same cost as before (i.e. d\m‘^{m + 
2mm{m,k))‘^). The new search cost is therefore 



j jk 



+ 2 min(m, k)Y 



jTn^ —km^ I 



where the first term dominates for j < m/(<ilog^m), the second one up to 
j < m/(log^m + log^kY^’^, and the third one for larger j. The fourth term 
decreases with j, and therefore it is not immediate that the minimum j is the 
optimum (in fact it is not). We have not been able to determine the optimum 
j, but we can still obtain the maximum k value up to where the filter is better 
than dynamic programming. The first two terms are never worse than dynamic 
programming, and the third improves over dynamic programming for 

m , , ,, 

^ ~ {^og„m + \og„k- dlog^dy/'^ ^ 

which gives a condition on k since j^^~^ > k: 

k < k'^ = ^(1 + 0(1)) 

(d(log,^ TO - log,^ d)) d 

Now, we introduce this maximum j value into the fourth term to determine 
whether it is also better than dynamic programming at that point. The result 
is that, using that j value, the fourth term is dominated by the third precisely 
for k < kY Therefore we improve over dynamic programming for k < k '2 (which 
is better than our previous ^2 limit). The proposed j is the best for high k 
values, but smaller values are better for lower k values. In particular, we may be 
interested in obtaining the sublinearity limit for this filter. The first three terms 
put an upper bound on j, the strictest one being 

TTl 

^ ~ d{log^ m - log„ d) + 

and using this maximum j value the fourth term gives us the maximum k that 
allows sublinear search time: 



< K = 



(d(log^ TO - log^ d)Y 



- (1 + 0 ( 1 )) 



which is slightly better than our previous ko limit. 
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7.2 Adapting the Filter to Substitutions 

The problem of searching a pattern allowing k substitutions is much simpler, and 
we can apply our machinery to that case as well. A brute force search algorithm 
checks any possible text position until it finds k mismatches. Being a geometric 
process, this occurs after 0{k) character comparisons, which makes the total 
search cost 0{kn‘^) on average. 

The same filter proposed in this section works for the case of k substitutions, 
the only difference being that in this case the cost to verify a candidate text 
position is 0{k), i.e. much cheaper. The search cost still has three terms, the 
first one being dominant for k < kg. The second component is now dominant for 

1 

k < K = — ^(1 + 0 ( 1 )) 

(dlog^m) 1 

and the last one dominates for k > k[. This filter is sublinear (i.e. does not 
inspect all the text characters) on average for k < ko as before. On the other 
hand, it turns out to be better than brute force (i.e. 0{kn‘^)) for k < k[, i.e. 
before the verification step dominates the search cost. 

8 Conclusions 

We have presented the first algorithms to search a multidimensional pattern 
in multidimensional text allowing editing errors along any dimension. This is 
a new model recently proposed in |S|. We have generalized to d dimensions 
their algorithm to compute edit distance, where we obtained 0{d\m?^) time and 
0{dlm^‘^~^) space (where the compared elements are of size m‘^). 

We have obtained and proved the correctness of the first search algorithm 
for this model, where a pattern of size m'^ is searched in a text of size n'^ at 
0{dlm‘^n'^) time and 0{dlm‘^n‘^~^) space. We have shown how to trade time for 
space, for instance with 0{dl3‘^m?'^~^) space we have 0(d!3‘^m‘^n^) time. 

Finally, we have proposed a filter which obtains roughly 0{kn‘^ (i.e. 
sublinear) average search time for k < {m/{d{log^m — log^ d)))‘^“^, where a is 
the alphabet size. After that error level the filter changes its cost but remains 
better than dynamic programming for k < / {d{log^ m — log^ 

For instance, in two dimensions the filter is sublinear for k < m/(21og^ m) and 
better than dynamic programming for k < mj log^, m. 

These are the first search algorithms and fast filters for the first model which 
extends successfully the concept of approximate string matching to more than 
one dimension. Although we present the algorithms for square d-dimensional 
pattern and text, they also work for hyper-rectangular elements. 

Our work is a (very preliminary) step towards presenting a combinatorial 
alternative to the current image processing technology. However, for this to be 
successful, we must allow not only errors but also rotations, scalings and defor- 
mations in the images. There are some works addressing those issues separately 
pmn, but they have not been merged. We are currently working on this inte- 
gration. 
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1 Introduction 

RNAs (Ribonucleic Acids) play an important role when organisms reproduce 
themselves. RNAs are single-stranded, however they tend to form higher order 
structures such as secondary or tertiary structures by folding onto themselves. 
It is the RNA structures that determine the functions of RNA sequences. Since 
it is very difficult to crystallize and/or get nuclear magnetic resonance spectrum 
data for large RNA molecules, reliable methods to determine RNA structures 
from the primary sequences is important. An important step toward the deter- 
mination of RNA structure is the prediction of RNA secondary structures. Based 
on a reliable RNA secondary structure, possible tertiary interactions that occur 
between secondary structural elements and between these elements and single- 
stranded region can be characterized. Thermodynamic stability methods have 
been developed to fold a single RNA into secondary structures with minimum 
or near minimum energy with some success. Phylogenetic comparative methods 
are more successful which try to determine the common secondary structures 
from a set of RNA sequences by checking a large number of possible base pair- 
ings for their possible conservation. However this method is very tedious since it 
is basically performed manually. In this abstract, we propose an algorithm using 
dynamic programming trying to automate the phylogenetic comparative pro- 
cess. Given three RNA sequences, we first apply the folding algorithms for each 
sequence to determine the frequently recurring stems which are considered to be 
thermodynamically favourable. We then apply our algorithm to the three stem 
lists generated from the folding algorithm to determine the common secondary 
structures. We have applied our method to three viruses: cocksackievirus, human 
rhinovirus (type 14), and poliovirus (type 3). Our method successfully produced 
the main components of the common secondary structures of these viruses. 

2 Notations 

An RNA molecule is made up of a long chain of subunits (ribonucleotides) linking 
together. Each ribonucleotide contains one of four possible bases, A (adenine), 

* Research supported partially by the Natural Sciences and Engineering Research 
Council of Canada under Grant No. OGP0046373. 



M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. 258-|2Sni 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



Finding Common RNA Secondary Structures from RNA Sequences 259 



C (cytosine), G (guanine), and U (uracil). Thus an RNA molecule is uniquely 
determined by its sequence of bases. RNAs fold by intramolecular base pairing. 
RNA secondary structures are stabilized by the hydrogen bonds that results from 
these base pairing. In addition, the stacking of base pairs in a helix stabilizes 
the molecule and decreases the free energy of the folded structure. However the 
appearance of loops destabilizes the RNA structure. In an RNA structure, base 
pairs will usually be formed as one of the three kinds: G-C, A-U and G-U. There 
are three hydrogen bonds between G-G, two between A-U, and one between 
G-U. 

It is clear that the RNA secondary structure is much more complicated than 
the RNA primary structure. Given an RNA R = ai 02 ...a„, we use ai-aj to denote 
a base pair between Ui and aj where 1 < i < j < n. Following the tradition, we 
will refer to the first base as the 5' end of the pair and the second base aj as 
the 3' end of the pair. Formally we say 5 is a secondary structure if if satisfies 
the following conditions. 

1. If S' contains • aj, then ai and aj are either A and U, U and A, C and 
G, G and C, G and U, or U and G. 

2. If S contains • a^, then it cannot contain at ■ aj (with j k) or aj ■ au (with 
i yf j). (one-to-one) 

3. li h < i < j < k, then S cannot contain both ah ■ aj and ai-ak- (non-crossing) 

4. If S contains at ■ aj, then \j — i\ >4 

RNA secondary structure can be decomposed into five kinds of substructures, 
namely stems, hairpin loops, bulge loops, interior loops and multiple loops. If S 
contains -aj, Oi+i -aj-i, . . ., at+h-i -aj-h+i, we say each of these pairs (except 
the last) is stack on the following pair. We refer to these consecutive pairs as a 
stacked pair or as a stem and denote it as (i,j,h). RNA secondary structure is 
determined by its set of stems since this set is just another representation of its 
set of base pairs. 

We now consider the relationship between two stems I = (ii,ji,hi) and 
II = (f 2 ,j 2 ,/i 2 ) from a stem list (not necessarily a secondary structure), see 
Figure Hand El In Figure Hand El we use the so-called Domes Representation. 



a b i a' 

5' ' 3' 



Fig. 1. Grossing stems 
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5' a 



b 



a' b' 



Fig. 2. Non-crossing stems 



We say that stem / and stem II are crossing if and only if < ji < j2- 

In Figure 0 are crossing stems. We say stem / is before stem II or stem II is 
after stem / if and only if ji < 12, see Figure El We say stem / is outside stem II 
or stem II is inside stem / if and only ii i\ + h < i2 < J2 < ji — h, see Figure El 



3 Algorithm 



3.1 Definitions 



Since our algorithm with deal with three stem lists generated from folding algo- 
rithm, we now consider three stem lists S, T, and U . 

Given a triple (s,t,u), where s = (ii,ji,hi) is from S,t = (*2,^2, ^2) is from 
T, and u = (is^js^h^) is from U, we define score{s,t,u) = min{/ii, /12, /13}. 

Given two triples (si,ti,ui) and (32,^2, U2), where si and S2 are from stem 
list S, t\ and ^2 are from stem list T , and u\ and U2 are from stem list U, we 
say (si,ti,ui) and (32,^2, W2) are compatible if and only if 

1. 3i and 32 are not crossing; 

2. ti and t2 are not crossing; 

3 . ui and U2 are not crossing; 

4 . 3i and S2 in S', ti and ^2 in T, and ui and U2 in U have the same relationship. 



Given three stem lists S, T and U , our goal is to find a maximal set of non- 
crossing stems from each list such that they form the same topological structure. 
Formally we define the weight of S, T and U as follows: 
weight{S, T, U) = 



max 
fcl ,/C2 , . • 



n 

score{ski , , Uk^ ) 

i=l 



Ski ^ ‘S'l tfei ^ T,Uki ^ j, 

(ski,tki,Uki) and {skj,tkj ,Ukj) are compatible 



Suppose that S, T, and U are sorted by the 3 ' end of the stems. Let Si be 
the ith stem in S, tj be the jth stem in T, and Uk be the kth. stem in U . Let 
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S[ be the stem list containing the stems below Si, Tj be the stem list containing 
the stems below tj, and C/^ be the stem list containing the stems below Uk- We 
define forest jweight{i, j,k) and treejweight{i, j , k) as follows. 

forest-weight{i, j, k) = weight{{s\, ..Si}, {ti, --tj}, {mi, ..Uk}) 



tree-weight{i, j, k) = score(si, tj,Uk) + weight{S[, Tj, U^f) 



3.2 Properties 

Lemma 1. Let i' he the largest index such 3' end of Sii is less than 5' end of 
Si, f he the largest index such 3' end of tji is less than 5' end oftj, and k' he 
the largest index such 3' end of Uk> is less than 5' end ofuk, then 



forest-weight{i, j, k) 



f orest-weight{i — 1 , j, k) 
f orest-weight{i, j — 1 , k) 
forest-weight{i, j, k — 1 ) 
forest-weight{i' , f , k') + tree-weight{i, j , k) 



Proof. Consider the best way to match ti,...,tj, and u\,...,Uk, there 

are four possibilities. First, si does not match to any stem in T or U, therefore 
forest jweight{i, j, k) = f orestxweight{i — l,j, k). Second, tj does not match to 
any stem in S or U, therefore forest jweight{i, j, k) = f orestjweight{i, j — 1 , k). 
Third, Uk does not match to any stem in S or T, therefore forestxweight{i,j, k) = 
forest jweight{i, j, k — 1 ). Fourth, Si, tj, and Uk are matched up with each other, 
we have forest jweight{i, j,k)= forestjweight{i',j',k') + treejweight{i, j,k). 

□ 



Based on this lemma, we can implement the algorithm by using six nested 
loops. The resulting algorithm is reasonable for two stem lists. However for three 
stem lists, it is extremely slow. Note that for the practical application we always 
consider RNAs with some sequence similarity. This means that we do not need 
to consider all the triples. Instead, we only need to consider triples which are 
close. This will speed up the algorithm. 

We now refine our definition of weight{S, T, U). We introduce one parameter 
end-control to reduce the number of stems that we need to consider. Given two 
stems s = {ii,ji,hi) and t = {12, j2,h2), we say s and t are semi-matchahle if 
\ji ~ J2I < end-control. We say they are matchahle if they are semi-matchable 
and in addition \ii — 12] < end-control. We use s <> t to denote this relation. 



weight{S, T, U) = 



max 
fci ,fc 2 , • ■ • 



n 

score{ski , tk^ , Uk, ) 

i=l 



Ski ^ & T,Uki ^U'y jy 

{ski , tki , Uki ) and {sk^ , tk ^ , Mfej ) are compatible 
for any i, Sk^ tk^Pk, Uki,Uki 



With this refined definition, the definitions of forestjweight(i,j,k) and 
treejweight{i, j,k) remain the same as before. 
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Lemma 2. Let i' be the largest index such 3' end of Si' is less than 5' end of 
Si, j' he the largest index such 3' end of tj> is less than 5' end oftj, and k' he 
the largest index such 3' end of Uk' is less than b' end ofuk, then 



forest_weight{i, j, k) = min < 



f orest_weight{i — l,j, k) 
forest_weight{i, j — 1, k) 
forest_weight{i, j, k — 1) 
forestjweight{i',j', k') + treejweight{i, j, k) 
if Si <> tj, tj <> Uk, and Uk <> Si 



Proof. The proof is exactly the same as that of lemma Q except that in order 
for Si, tj, and Uk to match they have to satisfy the condition that Si <> tj, 
tj <> Uk, and Uk <> Si. □ 

From this lemma we know that we only need to compute treexweight{i, j , k) 
in case Si <> tj, tj <> Uk, and Uk <> Si. 

For stem Si in stem list S, consider its semi-matchable stems in stem list T. 
Since T is sorted, there is an interval [s, e] such that tj is semi-matchable with 
Si if and only if s < j < e. 

For any stem Si in S, let s^{i) and ej.(i) be the starting and ending indices 
of Si’s semi-matchable stems in stem list T. Similarly we can define Sui^, e^(f), 
esU), and e^(fc). 

Lemma 3. 

forest jweightfi, j, k) = forest-weight{i, j — 1, k) if j > ef (z) 

forestjweightfi, j,k) = forest-weight{i, j, k — 1) if k > efj{i) 

forest-weightfi, j,k) = forest-weight{i — l,j,k) if j < sf(z) or k < s^(z) 



Proof. When k > efj{i), Uk is useless since it cannot match any s/ where 1 < I < 
i. Therefore forest -weight (i, j, k) = forest-weight{i, j,k — 1). Similarly, we can 
prove that if j > e^{i), then forest-weight{i, j, k) = f orest-weightfi, j — l,k). 
If j < sf (z) or k < Sy(z), then Si is useless since either it cannot match to 
ti where 1 < I < j or it cannot match to ui where 1 < I < k. Therefore 
forest-weight{i,j,k) = forest-weight{i — l,j,k). □ 

From this lemma, we know that for each Si in stem list S, we only need to 
compute forest-weight{i, j,k) such that sf(z) < j < e^(z) and sfj{i) < k < 
efj{i). 

Lemma 4. 

If j > e^{i) and Sif{i) < k < efj{i), then 

f orest-weight{i, j,k) = forest-weight{i,e^{i),k). 

If s^{i) < j < e^(z) and k > efj{i), then 

f orest-weight{i, j,k) = forestjweight{i, j, e^(z)). 
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If j > e^{i) and k > then 

forest jweight{i,j, k) = forest jweight{i,e^{i),efj{i)). 



Proof. We can prove this lemma by applying lemma 0 repeatedly. □ 

Lemma 5. 

If j < Sj’{i) and k > sff{i), then 

forest_weight{i,j, k) = forest jweight{e^{j),j,k). 

If j > and k < sfj{i), then 

forest jweight{i,j, k) = forestjweight{e^{k),j,k). 

If j < Sj<{i) and k < sfj{i), then 

forest_weight{i, j, k) = forestjweight{mm{e^{j),e^{k)},j, k). 



Proof. We can prove this lemma by applying lemma 0 repeatedly. □ 

Lemma 6. If s|.(z) < j and sfj{i) < k, then 

f orest-weight{i, j , k) = f ore st -weight {i^ min{j, e®(i)}, minjfc, e® (i)})) 

Proof. Immediately from lemma 0 □ 

Lemma 7. If j < sf(i) or k < s^{i), let i\ = min{e 5 (j), (fc)}, then 

forest-weight{i, j, k) = f orestjweight{ii , min{j , e§.{ii)}, min{/c, e^(ii)}) 



Proof. If fc > s^{i), then e^{k) > i and If j > sf (i), then e'g{j) > i. Therefore 
by lemma0 forestjweight{i,j, k) = forestjweight{mm{e^{j), Cg (fc)}, j, fc). Let 
ii = min{eg(j), eg (fc)}, by lemma^ forestjweight{i, j, k) = forest jweight{ii, 
min{j,e:^(ii)}, minjfc, e^(ji)}). □ 

3.3 Algorithm 

The algorithm works as follows: 

— For each triple of stems (si,tj,Uk) that are matchable, calculate 
treejweightfi, j, fc); 

— Compute weight{S,T,U). 

— Trace back to collect the matchable stems in each stem list that contribute 
to the weight{S, T, U). 

The algorithm to calculate treejweight{i, j, k) and weight{S,T,U) is given 
in figure 0 
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Input: Three stem lists S, T, and U. 

Output: weight{S,T,U). 

compute s‘^{i) and ey(i); 1 < * < liSI 
compute Sy(i) and e^(i); 1 < i < |S| 
compute es(i); l<j< \T\ 
compute es(fc); 1 < A: < \U\ 

compute beforej}tem[S,i), 1 < i < IS'I; 
compute beforestem{T,j), 1 < j < |T|; 
compute beforestem{U,k), 1 < A: < |17|; 

for i := 1 to |5| do 

ii = beforestem{S,i); 
for j := Sy(i) to ey(i) do 
jj = beforej3tem{T,j)-, 
for k := to efj{i) do 

kk = before-Stem{U,k); 

= adjust(i - A;); 

m[0] = forest jweight{i' ,j' ,k'); 

(*',/, A:') = adjust{i,j - l,k)-, 
m[l] = forest jweight{i' ,j' ,k'); 

= adjust{i,j,k - 1); 
m[2] = forest jweight{i' ,j' ,k')-, 
m[3] = 0; 

if Si <> tj and tj <> Uk and Uk <> Si then 
(i', j', k') = adjust ( ii, jj, kk ); 
m[3] = forest jweight{i' ,j' ,k'); 

+ tree_weight[i][j][k]; 

forestjweight[i][j][k] = max(m[0], m[l], m[2], m[3]) 
return forestjweight{\S\, |T|, |?7|); 



Fig. 3. Procedure: Computing weight{S,T, U) 



We first compute ef (i), eg{j), and eg{k). This step can 

be done in linear time. 

We then compute bef orestem{S, i) which denotes the largest indexed stem 
in S whose 3' end is less than sfs 5' end. We also compute beforestem{T,j) 
and beforestemiU, k). After a sorting, this step can be done in linear time. 

The main part of the algorithm is three nested loops. For any given i, where 
1 < i < IS'I, we only calculate forest_weight{i,j,k) for s|.(z) < j < ef(z) and 
sfiii) < k < e^{i). When calculating forestjweight{i,j,k), we may refer to 
locations which are outside our calculating range. In this situation, we need to 
adjust the indices by using lemma 0 or lemma 0 
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Let riT be the maximum number of stems in T that are semi-matchable with 
a single stem in S. Let njj be the maximum number of stems in U that are 
semi-matchable with a single stem in S. The time complexity of the algorithm 
in figure 0 bounded by 0(|5| x nr x nu). The toltal time complexity of our 
algorithm is is OdiSp x n\ x n\j). With the end-control parameter, we can 
control riT and nu. 

4 Experiment Results 

In our experiments, we used three viruses — cocksackievirus, human rhinovirus 
(type 14) and poliovirus (type 3) 0. We call them CVB, HRV, and POL. These 
viruses are believed to belong to the same family, and therefore they should have 
common secondary structures. 

We first apply the folding algorithms jH| for each sequence to determine 
the frequently recurring stems which are considered to be thermodynamically 
favourable. We then apply our algorithm to the stem lists generated to determine 
the common secondary structures. 

The results shown in figure 01 0 and 0 are obtained by setting parameters 
end3- control to 20. RNA secondary structure display is provided by Structure- 
Lab from U.S. National Cancer Institute. 

Compared with the published results in |2| using phylogenetic comparative 
methods, our algorithm produced main components of the common secondary 
structures. 

From base 1 to base 89, our results are almost the same as those in |2|. 

From base 241 to base 441^ tbe shape is also almost the same except for 
missing some short stems, which are missing from the input stem lists as well. 
These short stems can be generated by running the folding algorithm again with 
the stems we have found fixed. This will make the large internal loops disappear. 

From base 445 to base 560, we also got the same shape. 

The substructure from base 90 to base 240 differs considerably. Again, this 
is caused by the lack of appropriate input stems. 

Note that using folding algorithm alone cannot produce the correct secondary 
structure model. In our experiment, for each sequence, the folding algorithm 
generated thousands of secondary structures none of which is close to the correct 
model. 

In conclusion, together with the folding algorithms, our algorithm can pro- 
duce main components of common RNA secondary structures from RNA se- 
quences. Base on these components, a more accurate model can be developed. 

5 Conclusion and Future Work 

We present an algorithm which produced reasonable models for common sec- 
ondary structures from three RNA sequences. 

We are currently improving our algorithms. The score between three stems 
is too simple to be realistic. We plan to change it into more meaningful measure. 
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We would also like to do a preprocessing to check the situation where a stem 
has a matchable stem in the second list but not in the third list. In this case 
we may want check output from folding algorithm to see if some corresponding 
stem exist with lower frequence. These stems can be added to the third stem 
list. 




Fig. 4. Figure of CVB 
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Fig. 5. Figure of HRV 
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Abstract. This paper examines the complexity of comparing sequences 
that have arcs linking symbol pairs. Such arc-annotated sequences can 
represent molecular sequences with bonds between bases, such as RNA 
sequences. Crossing arcs that can represent sequence pseudoknots are 
included. The problem of finding the longest common subsequence, on 
which pairwise sequence comparison algorithms are frequently based, is 
modified to require common subsequences to preserve the arcs induced 
by the selected symbol positions. This problem is then analyzed using 
classical and parameterized complexity. It is shown to be NP-complete, 
and also W[l]-complete when parameterized by desired length of common 
subsequence. If it is parameterized instead by arc cutwidth k, however, 
it becomes fixed-parameter tractable, and usable for sequences with arc 
structures of limited cutwidth. An algorithm is given that runs in time 
€ 0(9*^ nm). 

1 Introduction 

Genetic and protein sequence similarity can indicate evolutionary similarity and 
some functional similarity between the sequences. One common way to measure 
the similarity of two sequences is pairwise sequence alignment, a method of com- 
parison based on the longest common subsequence algorithm (see 0 and 0). 
Arcs that link bases within a sequence can be used to indicate secondary struc- 
ture of molecular sequences by representing molecular bonds and links between 
pieces of the molecule’s structure. These arcs can be incorporated into sequence 
comparison to produce an overall measure of similarity between sequences an- 
notated with arcs. 

Previous work on aligning arc-annotated sequences has focused on RNA se- 
quences, where the arcs represent bonds. Matched arcs are used to enhance or 
guide the sequence alignment and improve its similarity score. Structures with 
pseudoknots, which require the representative arcs to cross, are usually excluded. 
Early work on RNA alignment involved predicting common secondary structure 
while aligning the sequences 0 . Corpet and Minchot 0 produced an algorithm 
that aligns a new sequence with a bank of aligned sequences, matching the new 
sequence to the bank while also preserving as much as possible of the common 
secondary structure of the sequence bank. This algorithm runs in time G 0(n^) 



M. Crochemore, M. Paterson (Eds.): CPM’99, LNCS 1645, pp. 270-|2Sni 1999. 
(c) Springer- Verlag Berlin Heidelberg 1999 



Finding Common Subsequences with Arcs and Pseudoknots 271 



for sequence length n. The algorithm of Bafna et al. Q aligns two sequences with 
nested arcs only, and uses weights for both sequence and arc matching. However, 
it does not detect mismatches for the arcs, and ignores the arc information if 
it does not improve the score. It also has worst-case time complexity 0{n?w?), 
where n and m are the sequence lengths. For long sequences, this time complexity 
can be too high. This time complexity is also independent of the complexity and 
depth of the arc structure, and thus could not exploit this structure to reduce the 
time required. Lenhof et al. do include pseudoknots in their graph-based work on 
RNA sequence alignment 0 . Their algorithm, however, aligns sequences where 
only one sequence has an associated structure. Like the work before it, the links 
between base pairs are used to enhance the alignment. 

This paper examines the problem of finding the longest common subsequence 
of a pair of arc-annotated sequences. The common subsequence must not only 
match both sequences, but also preserve all arcs that link subsequence sym- 
bols. This analysis specifically looks into the complexity of solving this problem 
when the arcs can cross; pseudoknots in sequences can thus be represented. 
Different parameters of the problem are examined, determining the conditions 
for usable or effective computation of the arc-preserving longest common subse- 
quence. This problem is proved to be NP-complete, and it is also W[l]-complete 
when parameterized by the desired length of common subsequence. If a bound 
on the arc structure’s cutwidth, the number of arcs that cross any position, is 
used as the parameter instead, the problem is fixed-parameter tractable. An al- 
gorithm is given for this variant that finds the length of the longest common 
subsequence in 0{9^nm) for cutwidth k and input sequence lengths n and m. 
For k < logg(max(n, m)), this algorithm’s time complexity is less than that of 
earlier work. 

2 Problem Definition 

The Arc-Preserving Longest Common Subsequence problem for sequences with 
crossing arcs is defined as: 

Input: target length I and the pair of annotated sequences {Si, Pi) and {S2, P^.)- 

These annotated pairs consist of the sequences Si and S2 over some fixed 
alphabet E, with arc annotations Pi C |S'i|}^ and P2 C {!,..., |<S'2|}^. 

Each set of arcs Pa is further restricted so that V(zi, 12) and (*3, *4) S Pa, i\ = h 
if and only if 12 = 14, and 12 ^ h- These conditions require that each sequence 
position can be an endpoint for at most one arc; it is linked to at most one other 
position. The length of Si is n and the length of S2 is m. 

Output: returns true if and only if there was some mapping MS C {l,...,|S'i|}x 
{1 , . . . , |S'2|} between the positions of Si and S2 such that \MS\ = I and 

1. the mapping is one-to-one and preserves the order of the subsequence : 

V(ii,ji) e MS and (12,^2) G MS 

A = i2 if and only if ji = j2 
ii < i2 if and only if ji < j2 
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2. the arcs induced by the mapping are preserved : 

V(ii, ji) e MS and (i 2 ,j 2 ) £ MS, ( 11 , 12 ) G Pi if and only if (ji, J 2 ) G P 2 

3. the mapping produces a common subsequence : 

W{i,j)gMS, ^i[z] = 52[j] . 

If any of these conditions is not met, false is returned. 

This problem is slightly different from those addressed by others. The preser- 
vation of the arcs induced by the subsequence is enforced. The algorithm of Bafna 
et al. m instead allows the arcs to be ignored if they do not contribute posi- 
tively to the alignment score. The arcs in that case are used to enhance the 
alignment and its value, but do not control it; mismatches between arcs are not 
detected, and are thus disregarded. Extensions of the algorithm in this paper can 
include weights for symbols and arcs, and the arc weights can be both positive 
and negative. Negative arc weights are only possible because the information 
represented by the arcs is not ignored by the algorithm. This additional feature 
enables the alignment parameters to be adjusted so that the weight of matched 
pairs does not overwhelm the entire alignment. The only restriction on this use 
of reducing weights is that the reduction needs to be no greater than the smaller 
of the two endpoint weights; otherwise, the endpoint will not be matched at all. 
Checking for arc mismatches, as this algorithm does, also shows the difference 
between matching unbonded bases, and matching a bonded with an unbonded 
base. Previous work does not allow for this distinction. Furthermore, previous 
work does not allow crossing arcs (or limits them to one sequence only), and 
therefore exclude sequences with pseudoknots from these comparisons. 



3 Hardness Results 

This problem can be analyzed using the techniques from both classical and pa- 
rameterized complexity. For classical complexity, it is proved to be NP-complete. 
For parameterized complexity, which can show the effect of specific parameters 
on the problem’s complexity, it is W[l]-complete when the desired subsequence 
length I is used as the parameter. 

The parameterized complexity hierarchy |3j is composed of the classes 



FPT C IF[1] C W[2] C • • • C W[t] C • • • C W[SAT] C • • • C W[P] 



and uses parametric reductions from problem A to problem B that require the 
parameter of B to be a function only in the parameter of A, independent of 
problem size. Clique is one problem that is complete for the class W[l] (shown 

in PI). 

Lemma. Clique is polynomially reducible and strongly uniformly parametri- 
cally reducible to Arc-Preserving LCS. 

Reduction. From fc-Clique of a graph G = (V,E), where V = {1,2,. ..,n}. 
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baaaaaabbaaaaaabbaaaaaabbaaaaaabbaaaaaabbaaaaaab 



3 - clique : > baaabbaaabbaaab 



LCS Alignment: 




baaaaaabbaaaaaabbaaaaaabbaaaaaabbaaaaaabbaaaaaab 



baa a bbaa a b 



baa a b 






Fig. 1. Example of transformation from Clique to Arc-Preserving LCS. 



Construct -I- 2n)] and Pi as follows: 

51 = (6a"6)" 

Pi = {((rt — l)(n -I- 2) -I- 1, u{n + 2))|u G y}U 
{((u - l)(n -I- 2) -I- u -I- 1, (u - l)(n -I- 2) -I- M -I- u) G E} 

Construct «S'2[1..(A:^ -|- k)] and P 2 as follows: 

52 = {ba%f 

P 2 = {((it - l)(fc -b 2) -b 1, u{k -b 2))|it G V}U 
{((it — l)(fc -b 2) -b u -b 1, (u — 1)(A: -b 2) -b M -b 1 )|m, u G {1, . . . , fc}, M ^ u} 

The parameter is I = k^ + 2k, while the maximum sequence length is n' = 
-b 2n. Since the length of the sequences is bounded by a polynomial in n, this 
is a polynomial reduction. The sequence parameter I is a polynomial only in k 
and is independent of n, so this reduction is also a parameterized reduction. 

The target subsequence size is the same as the length of the second sequence, 
so we are really asking if the arc sequence S 2 is a subsequence of the arc sequence 
5i. Figure n illustrates an example of this reduction. 



274 



Patricia A. Evans 



Proof Sketch. We need to show that the graph G = {V, E) has a clique of size 
k if and only if (Si, Pi) and (S 2 , P 2 ) have an arc-preserving common subsequence 
of length I = + 2k. 

Let V G V he & clique in G of size k. Each vertex u € V' corresponds 
to a segment of Si, ba^b, and a segment of S 2 , ba^b. If this pair of segments 
are matched, they will contribute fc -I- 2 to the length of a common subsequence. 
The pair of b symbols are joined by an arc in both segments, so those arcs are 
preserved. The a symbols in each segment are linked, in order, to the a symbols 
in the other segments that correspond to vertices from the clique V . Thus k 
of the a symbols of Si, those linked to the other selected segments (plus one 
for the segment itself), can also be matched to the corresponding segment of S 2 
while preserving the arcs. The common subsequence is the concatenation of the 
segment pairs’ common subsequences, and has length k{k + 2). 

Let the annotated sequences (Si, Pi) and (S 2 ,P 2 ) have an arc-preserving 
subsequence of length I = k^+2k. The linked pairs of b symbols in both sequences 
enforce the matching of symbols from only k segments of Si , with exactly k + 2 
symbols matched from each segment. From these k + 2 symbols, 2 of them are b 
symbols. Of the remaining k symbols, one is not linked, while the others are all 
linked to symbols from different segments that are also from the selected set of 
k segments. Since these arcs were defined to link segments that corresponded to 
vertices in G that were joined by an edge, a pair of vertices is linked if and only 
if their corresponding segments are linked. The subsequence must preserve the 
arcs from (S' 2 , P 2 ) that link all of its k segments. Therefore all selected segments 
from are also linked pairwise by arcs, and the set of k corresponding vertices 
of the graph G are a clique in G. □ 

Theorem 1. Arc-Preserving LCS is NP-complete. 

Proof: All the requirements for a solution to the Arc-Preserving LCS problem 
can be checked in polynomial time, so it is in NP. It is polynomially reducible 
from Clique by the above Lemma, so it is also NP-hard. Thus Arc-Preserving 
LCS is NP-complete. □ 

Theorem 2. Arc-Preserving LCS is W[l]-complete when parameterized by 
desired subsequence length 1. 

Proof: Arc-Preserving LCS, parameterized by I, is in W[l] since any instance 

of the problem can be converted into a decision circuit with weft 1 whose ac- 
cepted input assignments of weight I correspond to the arc-preserving common 
subsequences of length 1. This circuit can be constructed with one input bit for 
each possible sequence position match (i,j). Each condition for acceptable input 
can be checked with a 2-input gate, and the circuit output can be set to 0 if any 
condition is violated, or 1 if all conditions hold. 

Clique, which is hard for W[l], is parametrically reducible to Arc-Preserving 
LCS (parameterized by 1) by the reduction in the Lemma given above. Thus 
Arc-Preserving LCS is W[l]-complete for the parameter 1. □ 
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4 Sequences with Bounded Cutwidth 

4.1 Necessary Data Structures 

While the problem is NP-complete, and is also W[l]-complete when parame- 
terized by desired subsequence length I, the problem becomes fixed-parameter 
tractable if it is instead parameterized by the arc cutwidth. This cutwidth is 
defined as the maximum number of arcs that cross or end at any arbitrary po- 
sition of the sequence. If both sequences have their cutwidth bounded by some 
k, the problem can be solved in time 0{f{k)nm) where f{k) is a function in 
k independent of both n and m. Sufficient bounds on k will make the problem 
tractable. 

On initial examination, this problem should be able to be solved by splitting 
the tables whenever an initial endpoint is encountered. Restricting the cutwidth 
of the sequences, then, should make this problem fixed-parameter tractable. 
These tables can be used to store the initial endpoint matches that are made 
on the various computation paths, and these paths can then be searched when 
a final endpoint is encountered. This searching can determine if any maximum 
computation path has matched the initial endpoints that correspond to the final 
endpoints encountered. However, this network of paths can, potentially, cover 
all positions in the tables, and may need to be searched completely. 




Fig. 2. Computation Path Network Example. 



The network of computation paths cannot be searched simply using a breadth- 
first search technique, nor can we only maintain a list of all arc assignments on 
the maximum computation paths that lead to each table entry. Since matching 
the final endpoint depends on whether (and to what) the initial endpoint was 
matched, the network of paths also includes path combinations that are not 
allowed. Symbol matches inside an arc would merge lists that need to be kept 
separate. In Figure 0 for example, the entry that matches the two a symbols 
brings together two computation paths that match initial endpoint g symbols; 
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however, the subsequent matching of the final endpoint symbols can only be done 
one way for each initial match, not two as the network of paths would indicate. 
Looking for an initial match in this network, then, would require searching each 
possible path through the different table positions in order to correctly 

compute the new table value. 

Since searching through the network of paths is very costly, the different valid 
computation paths can instead be kept in a tree. Each of the table positions 
should have its own tree of all valid computation paths. In order to be able to 
minimize the length of the paths in the tree, each position will need to have 
its own copy, instead of referencing trees at previous positions. These paths are 
kept short by removing initial endpoint matches of arcs that are no longer active 
(whose final endpoints have been encountered) . This editing reduces the length of 
each path to at most k, and the storing of the endpoint matches in a tree means 
that all paths to be searched are valid ones. Thus the time to search for an initial 
endpoint match is reduced while both the space used and the complexity of the 
data structure are increased. 



4.2 Bounded Cutwidth Algorithm 

The tree structure outlined above is used in the following algorithm that finds 
the arc-preserving longest common subsequence for arcs with bounded cutwidth. 

Theorem 3. The Arc-Preserving LCS problem, parameterized by cutwidth k, 
is Fixed-Parameter Tractable, and can be solved in time 0(9^nm). 

Proof. A solution for the Arc-Preserving LCS problem is found by the following 
algorithm. 

Algorithm: 

Step 1. For each of Pi and P 2 , partition the set of arcs into k sets, where each 
set contains a chain of arcs that do not cross or nest. Number these chains 
from 0 through k — 1. 

Step 2. For each of {Si, Pi) and (S' 2 , P 2 ), look at each subset of the set of chains. 
For each subset of chains of Pi, create a copy of the sequence Si with the 
initial endpoints of all arcs in those chains removed and replaced by some 
X ^ S. The set of sequences thus created is 5i, and is generally indexed by 
hi, where hi = ^i^subset“^^ ■ ^or (S 2 ,P 2 ), create the set of sequences S 2 in 
the same way, indexing it using h 2 - 

Step 3. For each combination of hi and / 12 , create a two-dimensional table, n x 
m, that uses strings and S 2 [h 2 ]- These tables will be used to calculate 

the length of the longest common subsequence. Each table position includes 
both a value j], the length of the longest common subsequence 

so far, and a tree j] of the initial arc endpoints matched along 

the computation paths that produce that value. These matches between 
initial endpoints are tentative assignments that will be checked when the 
final endpoints of the arcs are encountered. 
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Step 4. Calculate the longest common arc-preserving subsequence of {Si, Pi) 
and {S 2 , P 2 ) by traversing the tables. A table is considered active if one arc 
from each of the chains in its subset is active. 

At each step, the values for that position in all active tables are calculated. 
The trees are also merged and manipulated. This tree data structure needs to 
support the following operations. 

merge: is applied to a finite list of trees and their corresponding subsequence 
length values, and returns the merge of those trees that have the maximum 
corresponding values. When the trees are merged, they are copied and also 
simplified from the root down by uniting identical children of the same parent 
node. 

test: looks for a given arc assignment pair in the tree; returns true if it is found, 
false otherwise. 

prune: given a tree and an arc assignment pair, removes all paths that do not 
contain the pair, and then removes the pair itself. 

trim: given a tree, an arc number k' , and a flag value, removes all nodes in the 
tree that contain an arc assignment that involves an arc with that number 
k' . This operation checks either the i or j values, depending on the value of 
the flag. 

extend: given a tree and an arc assignment pair, add the pair to the tree as its 
new root. 

The complexity of each of these operations, except for extend, is proportional 
to the size of the tree, so G 0{\M'^^^'^^'>[i, j]\). The extend operation runs in 
constant time. 

These operations are used to keep the trees up to date as the table values 
are being calculated. The basic longest common subsequence formula 

T[i,j] = max(T[z - l,j],T[i,j - l],T[i- l,j - 1] -b w(S'i [z], S' 2 [j])) 



where 

/ 1 if a; = y 
’ ^ [0 otherwise 

is used to calculate the table values, although it can be changed to any LCS- 
based alignment weighting scheme. This calculation is varied when arc endpoints 
are encountered, as follows: 

1. When an initial arc endpoint is encountered, all tables that include that 
arc’s chain in its subset are activated and initialized by copying over needed 
values into the preceding row or column. 

2. When a final arc endpoint is encountered, the table without the initial end- 
point is calculated normally. The other table, where the initial endpoint was 
allowed to match, is calculated without matching the final endpoint. These 
two tables are then merged to find the maximum. The trees are trimmed to 
remove all assignments that use that arc. 
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3. If a pair of initial endpoints is encountered, one from each sequence, the 
algorithm attempts to match their arcs. The tables are activated and initial- 
ized as in 1, but one table - that has both initial endpoints - requires the 
tree at that position to be extended by adding that arc assignment pair. 

4. If a pair of final endpoints is encountered, the algorithm must determine (us- 
ing test) if the corresponding pair of initial endpoints are in the tree. If they 
are, and the maximum value is produced by matching the final endpoints, 
those endpoints are matched and the tree is pruned. Otherwise, the trees 
and tables are merged as in 2. 

After the table computation, the decision algorithm returns true if and only if the 
length of the longest common arc-preserving subsequence, stored in m], 

is at least I, and returns false otherwise. This algorithm computes table entries 
for up to 4^ tables, each table having nm entries. 



4.3 Time Complexity Analysis: 

The computation of each table entry takes time. 

In this algorithm, the trees are kept minimal so that they only store matches of 
starting endpoints of currently active arcs. 

To find the size of 

Each path in the tree is a sequence both on matched i values (i') and on 
matched j values (j'). Let pi{i') be the position of i' among the starting end- 
points of active arcs on ^i, so pi{i') = \{{ii,i 2 ) '■ (*i,* 2 ) € Pi, ii < i < i 2 , and 
ii < i' }|. Similarly, let P 2 U') be the position of j' among the starting endpoints 
of active arcs on 82 . 

At each position in the tree, replace the label {i',f) with {pi{i'),p 2 {j')). The 
sequences of pi(i') values and P 2 {j') values along any path from root to leaf are 
strictly decreasing. The maximum number of nodes in such a tree with {x, y) as 
its root is thus given by the recurrence relation 

S{l,y) = 1 Wy and S'(a;,l) = l \/x 

x—1 y— 1 

S{x,y) = 1 + EE S{t,r) Vy > l,Va; > 1 

t=l r=l 

which converts to 

S{x, y) = S{x - l,y) + S{x,y - 1) 
and is equivalent to the closed form 

Since the root of the entire tree j] may be blank with all possible 

children, the number of nodes in j] is at most where r is the 
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number of active arcs from (Si, Pi) and s is the number of active arcs from 

(^ 2 ,^ 2 ). 

To determine the total size of the trees over all the tables, consider that 
both sequences have bounded cutwidth k, and that the algorithm is currently 
computing any specific position [i,j] in each of the tables. Each time the tables 
are split, half of the tables are allowed to include the new assignment, while the 
other half cannot include it. So for each possible r and s, the number of tables 
that can have r active arcs from Si and s active arcs from S 2 is (^) • (^). Thus 
the total number of tree nodes at position [i,j] over all the 4^ tables is at most 






r—0 s—0 

This expression can be convoluted to get 



r s 
s 



S{k) = ^ 






■)2i 



= E 






s i: 



\t=0 



= 9* 



so the sum over all the tables of the number of tree entries that must be copied 
is no more than 9^. An asymptotic estimate for a lower bound on S{k) can be 
found by looking at one specific term of the sum, where t = ^. From this term. 



S{k) > 

This term can be expanded using factorials 

f k\ 



2~ 



2k I fc I 

3 ■ 3 ■ 



• 2 ~ 



and estimated using Stirling’s approximation (n! ~ \/27m(^)”) to get 




This approximation reveals that the upper bound of 9^ is very close to S{k)] it 
can be off by at most a factor of k. 

Since S{k) is the upper bound on the total size of the trees for each posi- 
tion (r, s) over all tables, and there are nm such positions to be computed, the 
algorithm runs in time 0{9^nm). □ 



5 Conclusion 

The examination of this problem using classical complexity shows that it is 
NP-complete. A parameterized investigation, however, reveals that it is W[l]- 
complete for desired subsequence length I, but fixed-parameter tractable for 
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bounded cutwidth k. Both hardness results come from a single dual-purpose re- 
duction. The algorithm presented to show fixed-parameter tractability for k runs 
in time G 0{9^nm). This time complexity means that if the complexity of the 
arc structure is bounded by a logarithm of the maximum sequence length n, the 
longest arc-preserving common subsequence can be found in time G 0(ji^m). 
This time complexity is an improvement over earlier results, and shows con- 
ditions under which the problem becomes tractable. The algorithm given also 
handles pseudoknots on both sequences, while previous work does not. 

The parameterized analysis indicates that the problem is tractable for se- 
quences with arc structure of bounded cutwidth. Different kinds of structures 
can be looked at to determine if they meet this restriction, or if they can be 
manipulated to meet it. Many RNA structures contain highly repetitive arcs. 
This repetition could be exploited to compress the arc structure into something 
that has bounded cutwidth. This algorithm can also be extended to work with 
weights for both match and mismatch for symbols and arcs. The algorithm de- 
tects arc mismatch, so it can apply weight penalties for this, and can also use 
negative weights for arcs. Using negative weights can allow for a small reduction 
in the score if both symbols of a linked pair are matched; this can alleviate the 
sometimes overpowering effect of the weights of matched pairs, allowing other 
matches to have more relative effect on the sequence similarity score. The use of 
negative arc weights is only possible with an algorithm that preserves induced 
arcs, such as the one given in this paper. 
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Abstract. The primary structure of a ribonucleic acid (RNA) molecule 
is a sequence of nucleotides (bases) over the alphabet {A, C, G, U}. The 
secondary or tertiary structure of an RNA is a set of base-pairs (nu- 
cleotide pairs) which forms bonds between A — U and C — G. For sec- 
ondary structures, these bonds have been traditionally assumed to be 
one-to-one and non-crossing. 

This paper considers a notion of similarity between two RNA molecule 
structures taking into account the primary, the secondary and the ter- 
tiary structures. We show that in general this problem is NP-hard for 
tertiary structures. We present algorithms for the case where at least one 
of the RNA involved is of secondary structures. We then show that this 
algorithm might be used to deal with the practical application. We also 
show an approximation algorithm. 



1 Introduction 

Ribonucleic Acid (RNA) is an important molecule which performs a wide range 
of functions in biological system. In particular it is RNA (not DNA) that contains 
genetic information of virus such as HIV and therefore regulates the functions 
of such virus. RNA has recently become the center of much attention because of 
its catalytic properties, leading to an increased interest in obtaining structural 
information. 

It is well known that secondary and tertiary structural features of RNAs are 
important in the molecular mechanism involving their functions. The presump- 
tion, of course, is that to a preserved function there corresponds a preserved 
molecular confirmation and, therefore, a preserved secondary and tertiary struc- 
ture. Therefore the ability to compare RNA structures is useful. 

In RNA secondary or tertiary structure, a bonded pair of bases (base-pair) is 
usually represented as an edge between the two complementary bases involved 

* Research supported partially by the Natural Sciences and Engineering Research 
Council of Canada under Grant No. OGP0046373. 
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in the bond. It is assumed that any base participates in at most one such pair. 
For the secondary structure, the edges of the bonded pairs are non-crossing. 

Following the notion of similarity in comparing sequences, we define a sim- 
ilarity between two RNA molecule structures taking into account the primary, 
the secondary and the tertiary structures. 

Results 

We show that computing this similarity between RNA tertiary structures is NP- 
complete. We present an algorithm for the case where at least one of the RNA 
involved is of secondary structure. We then show this algorithm could be used 
to compare tertiary structures in practical application Finally we will give an 
approximation algorithm. 

Related work 

Since the secondary structure appears as tree-like structure, there are works 
considering comparison using tree comparison jYitl'rlMllj . However these methods 
do not directly use base-paired nucleotides and unpaired nucleotides. Instead 
loops and stems (stacked pairs) are used as the basic unit making it difficult to 
define the semantic meaning in the process of converting one RNA into another. 
To overcome this difficulty, the method we propose in this paper directly use 
base-paired and unpaired nucleotides in the representation and apply some basic 
operations on them. 

Another line of works are primary structure based where the comparison is 
basically done on the primary structure while trying to incorporate secondary 
structure data m The weakness of this approach is that it does not treat a 
base-pair as a whole entity. For example, in the comparison of two RNAs, a 
base-pair from one RNA can have one nucleotide deleted while the other nu- 
cleotide matched to nucleotide (unpaired or even paired) in the other RNA. Our 
method treat base-pair as a unit, it can be matched to another base-pair, it 
can be deleted, or it can be inserted. This is closer to the spirit of the compar- 
ative analysis method currently being used in the analysis of RNA secondary 
structures either manually or automatically. 



2 Comparing Two RNA Structures 

2.1 RNA Structures and Basic Operations 

The primary structure of a ribonucleic acid (RNA) molecule is a sequence of nu- 
cleotides (bases) over the four-letter alphabet C,G,U}. The secondary 

or tertiary structure of an RNA is a set of base-pairs (nucleotide pairs) which 
formed bonds between A—U and C — G. Following Zuker |14llhlib| , we assume 
a model where there is no knots in the secondary structure. This means that 
for the secondary structure, the bonds are non-crossing. For tertiary structure, 
there is no restriction of non-crossing. 

Given an RNA structure R, we use i?[i] to represent the ith nucleotide of R. 
We use R[i..j] to represent the sequence of nucleotides from R[i] to R[j]. 
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We use S{R) to represent the set of structural elements consisting of both 
its set of base-pairs and the remaining unpaired nucleotides. 

_ {{i,j)\i <j and is a base pair in i?} 

' U{(i, z)|i?[t] is not involved in any base pair in R} 

We use S{R)[i..j] to represent the set of structural elements in sequence R[i-.j]- 

S{R)[i..j] = {r\r = (k,l) G S{R), i < k, I < j} 

For r = (i,j) G S{R), we use labeln{r) to represent label of r in R. If i = j, then 
labeln{r) = R[i] = R[j], otherwise labeln{r) — R[i]R[j], For r = (z,j) G S{R), 
i and j are often called the 5 ' end and 3 ' end of r. We define left{r) = i and 
right{r) = j. 

Following the tradition in sequence comparison mm. we define three oper- 
ations, relabel, delete, and insert, on RNA structures. For a given RNA structure 
R, each operation can be applied to either a base-pair in S{R) or an unpaired 
base. Relabelling a base-pair is to replace one base-pair in S{R) with another. 
This means that at the sequence level, two bases may be changed at the same 
time. Deleting a base-pair is to delete the pair from S{R). At the sequence level, 
this means to delete two bases at the same time. Inserting a base-pair is to insert 
a new base-pair into S{R). At the sequence level, this means to insert two bases 
at the same time. Relabelling an unpaired base is to replace it with another base. 
Deleting an unpaired base is to delete the base from the sequence. Inserting a 
base is to insert a new base into the sequence as an unpaired base. Note that 
there is no relabel operation that can change a base-pair to an unpaired base or 
vice versa. 

Following j 1 1 11 3 ) . we represent an edit operation as a — > 6, where a and b 
are either A or labels of base-pair from {A, C,G,U} x {A, C, G, U}, or unpaired 
base from {A, G, G, U}. 

We call a ^ b a, change operation if a A and 6 A; a delete operation if 
6 = A; and an insert operation if a = A. 

Let S' be a sequence si, ..., of edit operations. An S-derivation from RNA 
structure A to RNA structure R is a sequence of RNA structures Aq, ...,Ak such 
that A = Aq, B = Ak, and Ai-i Ai via Si for 1 <i <k. 

Let 7 be a cost function which assigns to each edit operation a ^ b a 
nonnegative real number 7(0 ^ b). We constrain 7 to be a distance metric. 
That is, i) 7(0 ^ 6) > 0 , 7(a — > a) = 0 ; ii) 7(0 b) = 7(6 — > o); and iii) 
7(0 ^ c) < 7(0 ^ 6) -b 7(6 ^ c). 

We extend 7 to a sequence of edit operations S by letting 7(S) = 7 (^i)- 

The edit distance between two RNA structures is defined by considering the 
minimum cost edit operation sequence that transforms one structure to the other. 
Formally the edit distance between Ri and i?2 is defined as: 

D(i?i,i?2) = min {7(T) | T is an edit operation sequence taking S{Ri) to 

S{R 2 )} 
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2.2 Mapping between RNA Structures 

Let r = (r;, Tr) and s = (si,Sr) be two elements in S(R) of an RNA R, we define 
the relation between r and s as follows. We say r is before s if < s/. We say r 

is inside s if s; < n and < Sr- We say r is cross-before s if r; < s/ and < s^. 

Let Ri and R2 be two RNA structures. We define a triple (M, i?i, R2) to be 
a mapping from i?i to R2, where M is a binary relation on S{Ri) x S'(i?2) such 
that 

( 1 ) For any (r, s) in M, 

r is a base-pair in i?i if and only if s is a base-pair in R2 . 

( 2 ) For any pair of (ri,si) and (r2,S2) in M, 

(a) ri = r2 if and only if si = S2 (one-to-one) 

(b) ri is before r2 if and only if si is before S2- 

(c) ri is inside r2 if and only if si is inside S2- 

(d) ri is cross-before r2 if and only if si is cross-before S2- 

We will use M instead of {M, Ri, R2) if there is no confusion. Let M be a 
mapping from i?i to i?2- Then we can similarly define the cost of M: 

l{M) = Y.(r,s)aMl(^o,belR^{r) labelR^{s)) 

+ ^ labelR^{s)) 

Mappings can be composed. Let M\ be a mapping from Ri to R2 and M2 
be a mapping from R2 to R3. Define 

Ml o M2 = {(r, f) I 3 s s.t. (r, s) € Mi and (s, t) € M2}. 



Lemma 1. IJ Mi o M2 is a mapping between Ri and R3. 2 ) 7 (Mi o M 2 ) < 
7(Mi) -|-7(M2). 

Proof. 1 ) follows from the definition of mapping. Let us check condition ( 2 ) only. 
Suppose that (ri, ti) and (r2, ^2) are in Mi o M2, by definition of mapping, there 
exist Si and S2 such that (ri,si) and (r2,S2) are in Mi and (si,ti) and (52,^2) 
are in M2. If ri is before r2, then by the definition of mapping, si is before S2. 
Therefore ti is before t2, again by the definition of mapping. Similarly if ri is 
inside r2 or ri is cross-before r2, then if ti is inside t2 or ti is cross-before t2- 
2 ) Let Ml be the mapping from Ri to R2, M2 be the mapping from R2 to R3, 
and Ml 0M2 be the composed mapping from Ri to R3. Three general situations 
occur, (r, s) € Mi o M2, r ^ Mi, or s ^ M2- In each case this corresponds 
to an edit operation ^{x — > y) where x and y may be labels or may be A. 
In all such cases, the triangle inequality on the distance metric 7 ensures that 
^{x ^ y) < 7(x ^ z) 7(2; ^ y). □ 

The relation between a mapping and a sequence of edit operations is as 
follows: 
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Fig. 1. RNA structure 1 



Lemma 2. Given S, a sequence of edit operations from Ri to R2, 

there exists a mapping M from R\ to R2 such that 'y(M) < j{S). Conversely, 
for any mapping Me, there exists a sequence of edit operations such that 7(6') = 

Proof. The first part can be proved by induction on k. The base case is fc = 1. 
This case holds because any single edit operation preserves the mapping con- 
ditions. In general case, let Si be the sequence si, . . . , Sfe_i of edit operations. 
There exist a mapping Mi such that 7(Mi) < 7(<S'i). Let M2 be the mapping 
for Sfe. From lemma Q we have 7(^/1 o M2) < 'y(Mi) + j{M2) < 7(5'). □ 

Based on the lemma, the following theorem states the relation between the 
distance and the mappings. 

Theorem 1. D(i?i,i? 2 ) = n™i{ 7 (M) \ M is a mapping from R\ to R2} 

Proof. Immediately from lemma El □ 



3 NP-Hard Result 

We now consider the problem of comparing RNA structures where both struc- 
tures are tertiary structures. We show that this is in general NP-hard. 

We will reduce the 3-SAT problem to this problem. 

Problem of 3-SAT 

Let S = C\ ■ C'2...Cn, where Ci = U Vi^ U uig), be an instance of 3-SAT 
problem. We will construct two RNA structures Ri and R2 as in Figure [D and 
Figure El 

In Ri, there are n segments each of which is enclosed by four base pairs. 
These base pairs are all AU pairs. And each segment is connected to every other 
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AAAA UUAA UUAA UUUU 



AAAA UUAA UUAA UUUU 




Fig. 2. RNA structure 2 



segment by four base pairs of CG type. Note that the number of base pairs in 
i?i is 4 • n • (1 + (n — l)/2). 

In i? 2 , for each there is a corresponding segment which is enclosed by 
two base pairs of AU type. Each clause Ci is then represented by segments of 
Uij , Vi ^ , Uij and is enclosed by another two base pairs of AU type. 

We now consider base pairs between segments in i ?2 ■ For each Vi ^, , define Si^, 
as follows. 



Sij, = {vji^ \i yf j and uy, is not complement of } 

For each uy, in Si^. there are four bases in the segment for Vi^ . li j < i then 
these bases are G’s, otherwise these are C’s. Suppose that uy, and Vg^ are in 
then the bases for uy, is before the bases for Vg^ if either j > g or j = g and 
I > h. Note that if vj^ is in then Vi^ is also in Sj^. Now suppose that i < j, 
then the bases in segment Vi^ for Uj, are C’s and the bases in segment uy, for 
Vii^ are G’s. In the RNA structure i? 2 , they form base pairs. Figure 0 shows an 
example involving two clauses. Let N be the number of base pairs in i? 2 , then 

lV = 8-n + 2.Er=iELils.J- 

It is clear that i?i and i ?2 can be constructed in polynomial time from an 
instance of 3-SAT problem S. In the following, we assume that each operation 
has unit cost. We will show that S can be satisfied if and only if D{Ri, R 2 ) = 
N -A-n-{l + {n-l)/2). 

Let S be an instance of 3-SAT problem and R\ and i ?2 be as in Figure ^and 
Figure 121 The following lemmas give the relationship between S and i? 2 )- 

Lemma 3. If S can be satisfied, then D{R\, R 2 ) = N — A-n-{l-\-{n— l)/2). 

Proof. If S can be satisfied, then for each clause Gi there is at least one Vi,. whose 
value is true. Consider R 2 , for each clause, we can first delete any segment which 
does not correspond to Vi^. and its enclosing base pairs. For the segment of Vi,., 
we can delete the bases which base paired with these segments that have already 
been deleted. The resulting structure after these deletions is exactly the same as 
R\. Therefore R 2 ) = — 4 • n • (1 -I- (n — 1) /2) since the number of base 

pairs in R 2 is N and in i?i is 4 • n • (1 -|- (n — l)/2). □ 



Lemma 4. If D{Ri, R 2 ) = N — A-n-{l + {n— l)/2), then S can be satisfied. 
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S = (X+Y+ZXX+Y+Z) 




Proof. In this case, every base pair in Ri is in the optimal mapping M. In 
addition, each base pair in i?i is matched to an identical base pair in i?2- This 
means that the four base pairs enclosing each segment must map to four base 
pairs for each clause in i?2- The only possibility for this to happen is that for 
each clause two (out of three) segments have to be deleted. Therefore mapping 
M in i?2 keeps one variable in each clause. And this variable is connected to 
all the variables left in other clauses by means of base pairing. So for any two 
variables left there is no conflict. Hence we can assign the value true to all the 
variables left and S is satisfied. □ 



Theorem 2. The problem of determining if D{Ri, R2) < k is NP-complete. 

Proof. This problem is clearly in NP since one can guess a mapping in Ri and 
i?2 and check to see if the cost is less or equal to k or not. 

By lemma El and lemma E] S is satisfied if and only if I?(i?i,i?2) = iV — 4- 
n • (1 + (n — l)/2). Therefore this problem is NP-hard. 

Hence this problem is NP-complete. □ 
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4 Algorithms 

When both RNAs are secondary structures, since there is no crossing, we can 
represent RNA structures as ordered forests and then use the tree edit distance 
algorithm to solve this problem H2G3|. 

We now consider the case where at most one of the RNA involved is tertiary 
structure. We present an algorithm which solves this problem An extension of 
our algorithm can handle the case where both RNAs are tertiary structures with 
H-type pseudo-knots. 



4.1 Properties 

We use a bottom-up approach. We consider smaller substructures first and 
eventually consider the whole structure. We can now consider how to compute 
Zl(Ri[h..ri], R2[l2:r2\). 

Let be an array containing pairs in S'(i?i)[Zi..ri] sorted by 3 ' end. 

Let iS'2[l..n] be an array containing pairs in S'(i?2)[^2--?'2] sorted by 3 ' end. 

Let 5 'i[z] = (si,ti) and S2[j] = (52,^2), we define lefti[i], cross Jefti[i] 
and crossjweighti[i] as follows. Ieft2[j], cross Jeft2[j] and crossjweight2[j] are 
defined similarly. 

; n _ / 3 ' end is less than si 

6/ ipj Q such k exist 

7 7-, r-i f 1 there exist a k < i, such that cross Joe fore S’lhl 
cross J,efti[i\ = u 1 ■ ^ 

^ \^U if no such k exist 

cross -Weight i[i\ = "f {label n^{Si[i\) A) 

l<fc<z,5'i [k\crossjDeforeSi [z] 

Again let S'i[i] = (si,ti) and S2[j] = (32,^2), we now define Di{i,j) and 
D2{i,j) as follows. 

Di{i,j) = D{Ri[li..ti], R 2 [l 2 --t 2 ]) 

D2{i,j) = D{Ri[si..ti], R2[s2-.t2]) 



Lemma 5. If lefti[i] ^ 0, left 2 [j] ^ 0, cross Jeft[i] ^ 0, or cross Jeft[j] ^ 0, 
then 



Di{i,j) = min 



Di{i - 1 , j) 

Di{i,j - 1 ) 

Di{lefti[i],left 2 \j]) 



+ "f {label R^{Sx[i\) A) 

-h 7(A ^ label R^{S2[j\)) 

D2{i,j) 

+cross jweighti[i] + cross-weight2[j] 
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Proof. Let S'i[z] = (si,ti) and 5 ' 2 [j] = (52,^2)- Consider the best mapping be- 
tween Ri[li..ti] and i?2[^2--t2]- If 5 'i[z] = (si,ti) is not in the mapping, then 
Di{i,j) = Dili - l,j) + "f {label B.i(Si[i]) A). If S2[j] = {s2,h) is not in 

the mapping, then D{i,j) = Di{i,j — 1 ) -|- 7(A ^ label R2{S2[j]))- If both 
and 52 [j] = (32,^2) are in the mapping, then they should map to 
each other by the definition of mapping. In this case, since one of the structures 
is a secondary structure, any base pair cross_before 5 i[i] or S2[j] will not be in 
the mapping and should be deleted. Therefore, if lefti[i] ^ 0 , or left2[j] ^ 0 , 
D{i,j) = Di{lefti[i],left2[j]) + D2{i,j) +crossjweighti[i] + cross -weight2[j]. 
If lefti[i] = 0 and left2[j] = 0 , and cross Je/t[i] yf 0 , or cross Je/t[j] yf 0 , then 
D{i,j) = D2{i,j) +crossjweighti[i] + cross jweight2[j]. If we define D( 0 , 0 )= 0 , 
then we can combine the above two cases. Note that one of the cross_weights is 
zero since in secondary structure, there is no crossing. Also if 5 i[z] and S2[j] are 
both single bases, both cross_weights are zero. □ 

Lemma 6 . Iflefti[i] = 0 , left2[j] = 0 , cross Jeft[i] = 0 , and cross Jeft[j] = 0 , 
then 

(Di{i-l,j) + 'y {label R^{Si[i]) ^ X) 

Di{i,j) = min Di{i,j - 1 ) -b y(A ^ labelR^{S2[j])) 

[ Di{i - l,j - 1 ) -b -f {label R^{Si[i]) label R^{S2[j])) 

Proof. Let 5 i[z] = (si,ti) and 52 [j] = (s2,t2)- Consider the best mapping be- 
tween and i?2[^2--t2]- The first two cases are similar to lemma 0 For 

the last case, since there is no pair before or cross.before 5 i[i] or 52 [j], 

1 < A: < i, is inside 5 i[i] and 52 [A:], 1 < /c < j, is inside 52 [j]. Therefore Di{i,j) 
= Di{i — l,j — 1) +j{labelRj^{Si[i]) label R2{S2[j])) +cross -weight i[i] + 
cross -weight2[j]. □ 

From the above lemmas, we can compute Z 3 (i?i,i? 2 ) using bottom up ap- 
proach. Moreover, it is clear that we do not need to compute all D{Ri[li..ri], 
R2[l2"f’2])' Since we only use D2{i,j) in lemmaElandEl we only need to compute 
these D{Ri[li..n], R2[l2--r2]) such that (^i,ri) is abase pair in i?i and {h,r2) is 
abase pair in R2. Furthermore, by lemmaEl if {h,ri) and (Zi-bl,ri — 1 ) are both 
base pairs in Ri and {12,1^2) and {I2 -b l,r2 — 1 ) are both base pairs in i?2, then 
we only need to compute D{Ri[li..ri], R2[l2.-r2]) . D{Ri[li..ri], R2[l2 + ^..r2 — ^]), 
D{Ri[li + l..ri - 1 ], i?2[^2--?’2]), and D{Ri[li + l..ri - 1 ], i?2[^2 + l--?’2 - 1 ]) will 
be by-product of the computation of D{Ri[li..ri], R2[l2-.r2]). 

These base pairs are called stacked pairs. A stem in an RNA ii is a set of 
stack pairs of maximum size. More formally, we say s = {i,j,k) is a stem in 
R{S) if (i, j), {i -b 1 , j — 1 ), ...{i + k — l,j — k + 1 ) are all base pairs in R{S) and 
{i — 1 , j -b 1 ) and {i + k,j — k) are not base pairs in R{S). 

4.2 Algorithm 

Given R\ and i?2, we can first compute sorted stem lists L\ for R\ and L2 for R2. 
It follows from the above discussion that, for each pair of stems Li [i] = (ii , ji , fci ) 
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To compute ji], i?2[*2, ^2]) 

compute a sorted list Si of base pairs inside (*1,^1); 
compute a sorted list S2 of base pairs inside (12,^2); 

compute lefti[] and left2[]\ 

compute crossJefti[] and cross Jeft2[]', 

compute cross jweightiW and cross-weight2\\', 

Di(0,0) = 0 
for i := f to |Si| 
for j := 1 to IS2I 

if lefti[i] 0 or cross Jefti[i] / 0 or 
left2[j] / 0 or cross Jeft2\j] / 0 then 
Compute as in Lemma 0 

else 

Compute as in LemmaEI 



Fig. 4 . Procedure: Computing ji], J2]) 



and L2[j] = (i2,j2,fe), we have to compute ji], i?2[*2, j2])- Figured 

shows the algorithm. We use lemma|S|andE|to compute ji], i?2[*2, J2])- 

Figure 0 shows this computation. 

Let and R2[i..n] be the two given RNA structures. Let stem{Ri) 

and stem{R2) be the number of stems in i?i and i?2 respectively. The time com- 
pute Z 3 (i?i[zi, ji],i?2[*2, J2]) is bounded by 0 (|S'(i?i)| x |S'(i?2)|)- Since |S'(i?i)| < 
m and |S'(i?2)| < n, the time complexity of the algorithm is 0 {stem{Ri) x 
stem{R2) X TO X n). The space complexity of the algorithm is 0 (|S'(i?i)| x 
| 5 (i? 2 )|) = 0 {m X n). 

Note that when one of the RNA is secondary structure, this algorithm com- 
pute the optimal solution of the problem. This algorithm can be modified to han- 
dle the case where the input RNAs are tertiary structures with H-type pseudo- 
knots (a stem crosses with at most one other stem). 

If we represent the secondary structure by a forest, then by using the tech- 
nique of Klein | 3 ] we can compute similarity between a secondary structure and 
a tertiary structure in 0 {m?‘n\ogn) time where m <n. 

Note also that since the number of tertiary interactions is relatively small 
compared with the number of secondary interactions, we can also use this al- 
gorithm to compute the similarity when both structures are tertiary structures. 
Essentially the algorithm tries to find the best secondary structures to match and 
delete tertiary interactions. Although this is not an optimal solution, in prac- 
tice it would produce a reasonable result by matching most of the base pairs. A 
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Input: -Ri[l..m] and R2[l..n\. 

Compute a sorted (by 3 ' end) stem list Li for Ri. 

Compute a sorted (by 3 ' end) stem list L2 for i?2- 

for i := 1 to \L\\ 
for j := 1 to 1 1/2 1 

let Li[i] = (ii, ji, fci) 
let Li[j] = (i2,j2,k2) 
compute D{Ri[ii,ji], R2[i2,j2]) 

compute -D(i?i[l, m], i?2[2, n]) 



Fig. 5 . An algorithm: Computing _D(i?i,i?2) 



post-processing step can be applied to add some matching tertiary interactions 
which satisfy the mapping constraints. 

5 Approximation Algorithms 

In this section, we consider a maximization version of the problem. Let M be a 
mapping from Ri to i?2- The value S{M) of M is defined to be the number of 
identical pairs of base-pairs in M. Suppose that we define 7(0, b) to be 0 if a and 
b are identical; 2 if a and b are non-identical base-pairs; and 1 if one of them is A. 
Then 5 {M) + j{M) = ni + U2, where ni and U2 are the number of base-pairs in 
S{Ri) and S'(i?2). Instead of finding a M with the smallest cost 7(M), we want 
to find a M with the largest value 6 {M). Obviously, the maximization version 
is also NP-complete. 

We give an ratio- (6 — 1) -I- approximation algorithm for the case where 
each base-pair crosses with at most b other base-pairs. Due to space limitation, 
we only present the basic idea here. 

Our basic idea is as follows: We start with an arbitrary base-pair (i,j) in 
S{Ri) and consider (i,j) and the other at most b base-pairs (*i,ji), (*202)) 

. . ., and (ib,jb) crossing (i,j) in S{Ri). Call the 5-1-1 base-pairs (i,j), (ii,ji), 
(12, J2), ■ • and (ib,jb) a b- component for S{Ri). We use (i',f), (i'lJi), (*202). 

. . ., and {i'b,j'b) to denote a 5-component for S'(i?2). For each pair of subsequences 
Ri[p..q] and R2[p'..q'], we consider all pairs of 5-components for them. A match 
between the two 5-components contains A: -I- 1 matched pairs of base-pairs such 
that (z,j) matches {i',j') and the fc -|- 1 matched pairs of base-pairs satisfy 
(a)-(d) in the definition of a mapping, (z, j) and {i',j') form an imposed pair 
of base-pairs. The A: -I- 1 base-pairs form 2 {k + 1) positions in both Ri[p..q] and 
R2W --q'] that decompose both Ri[p..q] and R2[p' -.q'] into 2A;-|-3 segments, called 
matched segments. For each pair of 5-components for Ri[p..q\ and R2[p' -.q'], we 
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try all possible matches between the two 6-components. For each match, we 
forbid any other base-pairs not in the 6-components to cross any base-pair in 
the 6-components. The match between the corresponding matched segments are 
computed recursively. (See Figure 0) 




V h i is 3 3i <? P' i's A fs f fi 



(c) id) 

Fig. 6. (a) the set of specified links for Ri. (b) the set of specified links for i? 2 - 
(c) the preserved links for Ri in a match, (d) the preserved links for i ?2 in a 
match. (i,j) matches and matches for ^ = 1 and 3. Such a 

match form 7 matched segments for both R\ and i? 2 . 
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